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Abstract 

This thesis examines the problem of learning unknown target functions from examples. 
In particular, we focus on the informational complexity of learning these classes, i.e., the 
number of examples needed in order to identify the target with high accuracy and great 
confidence. There are a number of factors affecting the informational complexity, and we 
attempt to tease them apart in different settings, some of which are cognitively relevant. 

1) We consider a wide class of pattern classification and regression schemes known 
as regularization networks. We investigate the number of parameters and the number of 
examples that we need in order to achieve a certain generalization error with prescribed 
cofidence. We show that the generalization error is due in part to the representational 
inadequacy (finite number of parameters) and informational inadequacy (finite number of 
examples), and bound each of these two contributions. In doing so, we characterize a) the 
inherent tension between these two forms of error: attempting to reduce one, increases the 
other b) the class of problems effectively solved by regularization networks c) how to choose 
an appropriately sized network for such a class of problems. 

2) Rather than drawing its examples randomly (passively), suppose a learner were al- 
lowed to choose its own examples. Does this option allow us to reduce the number of 
examples? We derive a sequential version of optimal recovery allowing the active learner 
to adaptively choose points of maximum information. We compare this against the passive 
case, and classical optimal recovery, indicating superior performance. 

3) We investigate the problem of language learning within the principles and parameters 
framework. We show how certain memoryless algorithms operating on finite parameter 
spaces can be effectively modeled as a Markov chain. This allows us to characterize the 
learnability, and sample complexity of such linguistic spaces. 

4) We consider a population of learners attempting to learn a target language using some 
learning algorithm. We derive a dynamical system model (from the grammatical theory 
and learning paradigm) characterizing the evolving linguistic composition of the population 
over many generations. We examine the computational and linguistic consequences of this 
derivation, and show that it allows us to formally pose an evolutionary criterion for the 
adequacy of linguistic theories. 
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Chapter 1 



Introduction 



Abstract 

We introduce the framework in which learning from examples is to be studied. We develop a precise 
notion of informational complexity and discuss the factors upon which this depends. Finally, we 
provide an outline of the four problems discussed in this thesis, our major contributions, and their 
implications. 



Learning is the centerpiece of human intelligence. Consequently any attempt to un- 
derstand intelligence in the human being or to replicate it in a machine (as the held 
of artificial intelligence is committed to doing) must of necessity explain this remark- 
able ability. Indeed a significant amount of effort and initiative has gone into this 
enterprise and a collective wisdom has emerged regarding the paradigms in which this 
study is to be conducted. 

Needless to say, learning can mean a variety of things. The ability to learn a 
language, to recognize objects, to manipulate them and navigate through them, to 
learn to play chess or to learn the theorems of geometry all touch upon different sectors 
of this multifaceted activity. They require different skills, operate on different spaces 
and use different procedures. This has naturally led to a spate of learning paradigms; 
but most share one thing in common, i.e., learning as opposed to "preprogrammed" 
or memorized behavior involves the updating of hypotheses on the basis of some kind 
of experience: an adaptation if you will to the environment on the basis of stimuli 
from it. The connection to complex adaptive systems springs to mind and later in 
this thesis we will make this connection more explicit in a specific context. 

How then does one begin to study such a multifaceted problem? In order to 
meaningfully define the scope of our investigations, let us begin by considering a 
formulation by Osherson et al (1986). They believe (as do we) that learning typically 
involves 

I. A learner 
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2. A thing to be learned. 

3. An environment in which the thing to be learned is presented to the learner. 

4. The hypotheses that occur to the learner about the thing to be learned on the 
basis of the environment. 

Language acquisition by children is a classic example which fits well into this 
framework. "Children are the learners; a natural language is the thing to be learned; 
the corpus of sentences available to the child is the relevant environment; grammars 
serve as hypotheses." (from Systems that Learn; Osherson et al 1986). In contrast, 
consider an example from machine learning; the task of object recognition by the 
computer. Here the computer (or the corresponding algorithm) is the learner, the 
identity of objects (like chairs or tables, for example) are the things to be learned, 
examples of these objects in the form of images are the relevant environment, and 
the hypotheses might be decision boundaries which can be computed by a neural 
network. 

In this thesis we will concern ourselves with learning input-output mappings from 
examples of these mappings; in other words, learning target functions which are as- 
sumed to belong to some class of functions. The view of the brain as an information 
processor (see Marr, 1982) suggests that in solving certain problems (like object recog- 
nition, for example) the brain develops a series of internal representations starting 
with the sensory (external) input; in other words, it computes a function. In some 
cases, this function is hardwired (like detecting the orientations of edges in an image, 
for example), in others the function is learned like learning to recognize individual 
faces. 1 As another example of an input-output function the brain has to compute, 
consider the problem of speech recognition. The listener is provided with an acoustic 
signal which corresponds to some underlying sentence, i.e., a sequence of phonetic 
(or something quite like it) categories. Clearly the listener is able to uncover the 
transformation from this acoustic space to the lexical space. Note also that this 
transformation appears to be different for different languages, i.e., different languages 
have different inventories of phonetic symbols. Further, they carve up the acoustic 
space in different ways; this accounts for why the same acoustic stimuli might be 
perceived differently as belonging to different phonetic categories by a native speaker 



1 Functions mapping images of faces to the identity of the person possessing them may of course 
themselves be composed of more primitive functions, like edge detectors, which are hardwired. There 
is a considerable body of literature devoted to identifying the hardwired and learned components 
of this entire process from a neurobiological perspective. The purpose of this example was merely 
to observe that the brain appears to learn functions of various kinds; consequently studying the 
complexity of learning functions is of some value. 
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of Bengali and a native speaker of English. Since children are not genetically predis- 
posed to learn Bengali as opposed to English (or vice versa) one might conclude that 
the precise nature of this transformation is learned. 

Not all the functions we consider in this thesis can be psychologically well- motivated; 
while some chapters of this thesis deal with languages and grammars which are linguis- 
tically well motivated, Chapter 2, which concentrates in large part on Sobolev spaces, 
can hardly seem to be interesting psychologically. However, the central strand run- 
ning through this thesis is the informational complexity of learning from examples. 
In other words, if information is provided to the learner about the target function 
in some fashion, how much information is needed for the learner to learn the target 
well? In the task of learning from examples, (examples, as we shall see later are really 

often nothing more than (x, y = f(x)) pairs where (x, y) £ X X Y and / : X > Y) 

how many examples does the learner need to see? This same question is asked of 
strikingly different classes of functions: Sobolev spaces and context free languages. 
Certain broad patterns emerge. Clearly the number of examples depend upon the 
algorithm used by the learner to choose its hypotheses, the complexity of the class 
from which these hypotheses are chosen, the amount and type of noise and so on. 
We will try in this thesis to tease apart the relative contributions of each in specific 
settings in order to uncover fundamental constraints and relationships between oracle 
and learner; constraints which have to be obeyed by nature and human in the process 
of living. 2 

This then is our point of view. Let us now discuss some of the relevant issues in 
turn, briefly evaluate their importance in a learning paradigm, and the conceptual 
role they have to play in this thesis. 

1.1 The Components of a Learning Paradigm 

1.1.1 Concepts, Hypotheses, and Learners 

Concept Classes 

We need to define the "things" to be learned. In order to do this, we typically assume 
the existence of identifiable entities (concepts) which are to be learned and which 
belong perhaps to some set or class of entities (the concept class). Notationally, we 
can refer to the concept class by C which is a set of concepts c £ C. These concepts 



2 Even if we are totally unconcerned with human learning and are interested only in designing 
machines or algorithms which can learn functions from examples, a hotly pursued subject in machine 
learning, the issue of number of examples is obviously of considerable importance 
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need to be described somehow and various representation schemes can be used. For 
example, researchers have investigated concept classes which can be expressed as 
predicates in some logical system (Michalski, Carbonell, and Mitchell; 1986). For our 
purposes we concentrate on classes of functions, i.e., our concept classes are collections 
of functions from X to Y where X and Y are sets. We will define the specific nature 
of these functions over the course of this thesis. 

Information Sources 

Information is presented to the learner about a target concept c £ C in some fashion. 
There is a huge space of possibilities ranging from a "divine" oracle simply enlight- 
ening the learner with the true target concept in one fell sweep to adversarial oracles 
which provide information in a miserly, deliberately malicious fashion. We have al- 
ready restricted our inquiry to studying the acquisition of function classes. A natural 
and well studied form of information transmission is to allow the learner access to an 
oracle which provides (x, y) pairs or "labelled examples" perhaps tinged with noise. 
In a variant of the face recognition problem (Brunelli and Poggio, 1992; where one 
is required to identify the gender of some unknown person), for example, labelled 
examples might simply be (image, gender) pairs. On the basis of these examples then, 
the learner attempts to infer the target function. 

We consider several variants to this theme. For example, in Chapter 2, we allow the 
learner access to (x, y) pairs drawn according to a fixed unknown arbitrary probability 
distribution on some space X X Y. This represents a passive learner who is at the 
mercy of the unknown probability distribution, which could, in principle provide 
unrepresentative data with high probability. In Chapter 3 we explore the possibility 
of reconstructing functions by allowing the learner to choose his or her own examples, 
i.e., an active collector rather than a passive recipient of examples. This is studied in 
the context of trying to learn functional mappings of various sorts. Mathematically, 
there are connections to adaptive approximation, a somewhat poorly studied problem. 
Active learning (as we choose to call it) is inspired by various strategies of selective 
attention that the human brain develops to solve some cognitive tasks. In Chapters 
4 and 5 which concentrate on learning the class of natural languages, the examples 
are sentences spoken by speakers of the target language. We assume again that 
these sentences are spoken according to a probability distribution on all the possible 
sentences; there are two further twists: 1) no negative examples occur and 2) typically 
a bound on the length of the sentences is observed. In all these cases, the underlying 
question of interest is: given the scheme of presenting examples to the learner, how 
many examples does the learner need to see to learn well? This question will be 
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sharpened as we progress. 

The Learner and Its Hypotheses 

The learner operates with a set of hypotheses about reality. As information is pre- 
sented to it, it updates its hypothesis, or chooses 3 among a set of alternate hypotheses 
on the basis of the experience (evidence, data depending upon your paradigm of think- 
ing). Clearly then, the learner is mapping its data onto a "best" hypothesis which it 
chooses in some sense from a set of hypotheses (which we can now call the hypothesis 
class, 7i). This broad principle has found instantiations in many differing forms in 
diverse disciplines. 

Consider an example chosen from the world of finance. A stockbroker might wish 
to invest a certain amount of money on stock. Given the variation of share values over 
the past few years (a time series) and given his or her knowledge or understanding 
of the way the market and its players operate, he or she might choose to invest in a 
particular company. As the market and the share prices unfold, he (or she) might vary 
the investments (buying and selling stock) or updating the hypotheses. Cumulative 
experience then might "teach" him/her (or in other words, he/she might "learn") to 
play this game well. 

Or consider another mini-example from speech recognition (specifically phonetic 
recognition) mapping data to hypotheses. Among other things, the human learner 
has to discriminate between the sounds /s/ and /sh/. He or she learns to to do 
this by being exposed to examples (instances) of each phoneme. Over the course of 
time, after exposure to several examples, the learner develops a perceptual decision 
boundary to separate /s/ sounds from /sh/ sounds in the acoustic domain. Such 
a decision boundary is clearly learned; it marginally differs from person to person 
as evidenced by differing responses humans might have when asked to classify a 
particular sound into one of the two categories. This decision boundary, h, can be 
considered to be the learner's hypothesis of the s/sh distinction (which he or she 
might in principle pick from a class of possible decision boundaries Ti on the basis of 
the data). 

As a matter of fact, the scientific enterprise itself consists of the development of 
hypotheses about underlying reality. These hypotheses are developed by observing 
patterns in the physical world and represented as models, schema or theories which 
describe these patterns concisely. 



3 In artificial intelligence, this task of "searching" the hypothesis space has been given a lot of 
attention resulting in a profusion of searching heuristics and characterizations of the computational 
difficulty of this problem. In this thesis, we ignore this issue for the most part. 
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If indeed the learner is performing the task of mapping data to hypotheses, it 
becomes of interest to study the space of algorithms which can perform this task. 
Needless to say, the operating assumption is that the human learner is also following 
some algorithm; insights from biology or psychology might help the computer scientist 
to narrow the space of algorithms and a biologically plausible computational theory 
(Marr, 1982) might emerge. For our purposes then the learner is an algorithm (or a 
partial recursive function) from data sets to hypothesis classes. 

There is a further important connection between concepts and hypotheses which 
should be highlighted here. In our scheme of things, concepts are assumed to be 
the underlying reality; hypotheses are models of this reality. Clearly for successful 
learning (we discuss learnability in the next section) to occur, the elements of Ti should 
be able to approximate the elements of C, in other words, Ti should have sufficient 
power or complexity to express C. For learnability in the limit (Gold, 1967) or PAC- 
style (Probably Approximately Correct; Valiant, 1984) models for learnability, this 
notion can be made more precise. For example, if C is some class of real valued 
functions, Ti should probably be dense in C. 

1.1.2 Generalization, Learnability, Successful learning 

In addition to the four points noted earlier, another crucial component of learning 
is a criterion for success. Formally speaking, one needs to define a metric on the 
space of hypotheses in order to measure the distance between differing hypotheses, as 
also between the target concept and the learner's hypothesis. It is only when such a 
metric is imposed, that one can meaningfully decide whether a learner has "learned" 
the target concept. There are a number of related notions which might be worthwhile 
to introduce here. 

First, there is the issue of generalization. It can be argued, that a key component 
of learning is not just the development of hypotheses on the basis of finite experience 
(as experience must be), but the use of those hypotheses to generalize to unseen ex- 
perience. Clearly successful generalization necessitates the closeness (in some sense) 
of the learner's hypothesis and the target concept, for it is only then that unseen data 
(consistent with the target concept) can be successfully modeled by the learner's hy- 
pothesis. Thus successful learning would involve successful generalization; this thesis 
deals with the informational complexity of successful generalization. The learnability 
of concepts implies the existence of algorithms (learners) which can develop hypothe- 
ses which would eventually converge to the target. This convergence "in the limit" is 
analogous to the notion of consistency in statistical estimators and was introduced to 
the learning community by Gold (1967) and remains popular to this day as a criterion 
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for language learning. 

In our case, when learning function classes, Ti and C contain functions from some 
space X to some space Y, examples are (x, y) pairs consistent with some target func- 
tion c G C. Let the learner's hypothesis after m such examples be h m G 7i. According 
to some pre-decided criterion, we can put a distance metric d on the space of functions 
to measure the distance between concept and hypothesis (this is our generalization 
error) d(h m} c). Learnability in the limit would require d(h m} c) to go to zero as the 
number of examples, m, goes to infinity. The sense in which this convergence occurs 
might depend upon several other assumptions; one might require this convergence to 
hold for every learning sequence, i.e., for every sequence of examples, or one might 
want this to be satisfied for almost every sequence in which case one needs to assume 
some kind of measure on the space according to which one might get convergence in 
measure (probability). 

Convergence in the limit measures only the asymptotic behavior of learning algo- 
rithms; they do not characterize behavior with hnite data sets. In order to correct 
for this it is required to characterize the rates of the above-mentioned convergence; 
roughly speaking how many examples does the learner need to collect so that the gen- 
eralization error will be small. Again depending upon individual assumptions, there 
are several ways to formally pose this question. The most popular approach has been 
to provide a probabilistic formulation; Valiant (1984) does this in his PAC model 
which has come to play an increasingly important role in computational learning the- 
ory. In PAC learning, one typically assumes that examples are drawn according to 
some unknown probability distribution on X X Y and presented to the learner. If 
there exists an algorithm A which computes hypotheses from data such that for every 
e > and < 8 < 1, A collects ra(e, 8) examples and outputs a hypothesis h m satis- 
fying d(h m} c) < e with probability greater than 1 — 6, then the algorithm is said to 
PAC-learn the concept c. If the algorithm can PAC-learn every concept in C then the 
concept class is said to be PAC-learnable. Looking closely, it can be realized that PAC 
learnability is essentially the same as weak convergence in probability of hypotheses 
(estimators) to their target functions with polynomial rates of convergence. In any 
case, PAC like formulations play a powerful role in characterizing the informational 
complexity of learning; we have a great intellectual debt to this body of literature 
and its influence in this thesis cannot be overemphasized. 

Remark Sometimes, an obsession with proving the convergence of learning algorithms 
might be counterproductive. A very good example of that considered in this thesis is 
the problem of language learning and language change. We need to be able to explain 
how children learn the language of their social environment on the basis of example 
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sentences. In particular, researchers have postulated algorithms by means of which 
they can do this; considerable effort has gone into showing that these algorithms suc- 
cessfully converge to the target. However, this does not explain the simultaneously 
confounding fact that languages change with time. If generation after generation, 
children successfully converge to the language of their parental generation, then lan- 
guages would never change. The challenge lies in constructing learning paradigms 
which can explain both. In our thesis, we demonstrate this by moving into a model 
for language change by starting out with a model for language learning. The lan- 
guage change model is a dynamical system characterizing the historical evolution of 
linguistic systems; a formalization of ideas in Lightfoot (1991) and Gell-Mann (1989). 

1.1.3 Informational Complexity 

We have discussed how the learner chooses hypotheses from Ti on the basis of data 
and how one needs to measure the relative "goodness" of each hypothesis to set a 
precise criterion for learning. We have also introduced the spirit of the Gold and 
Valiant formulations of learning and their relationship to the issues of the number of 
examples and successful generalization. We pause now to comment on some other 
aspects of this relationship. 

First, note that for a particular concept c £ C, given a distance metric d, there 
exists a best hypothesis in Ti given by 

hoo = arg min o?(c, h) 

Clearly, if Ti has sufficient expressive power, then di^h^^c) will be small (precise 
learnability would actually require it to be 0). If Ti is a small class, then e?(c, /ioo) 
might be large for some c £ C and even in the case of infinite data, poor generalization 
will result. This is thus a function of the complexity of the model class Ti and how 
well matched it is to C, a matter discussed earlier as well. 

Having established that h^ is the best hypothesis the learner can possibly pos- 
tulate; it is consequently of interest to be able to characterize the convergence of the 
learner's hypothesis h m to this best hypothesis as the number of data, m, goes to 
infinity. The number of examples the learner needs to see before it can choose with 
high confidence a hypothesis close enough to the best will be our notion of informa- 
tional complexity. A crucial observation we would like to make is that the number 
of examples depends (among other things, and we will discuss this soon) upon the 
size of the class Ti. To intuitively appreciate this, consider the pathological case of Ti 
consisting of just one hypothesis. In that case, h m £ Ti is always equal to h^ £ Ti 
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and the learner needs to see no data at all. Of course, the expressive power of such 
a class Ti would be extremely limited. If on the other hand, the class Ti is very com- 
plex and for a finite data set has a large number of competing hypotheses which fit 
the data but extend in very different ways to the complete space, then considerably 
more data would be needed to disambiguate between these hypotheses. For certain 
probabilistic models (where function learning is essentially equivalent to statistical 
regression) Vapnik and Chervonenkis studied this problem closely and developed the 
notion of VC-dimension: a combinatorial measure of the complexity of the class Ti 
which is related to its sample complexity (see also Blumer et al (1986) for applications 
to computational learning theory). 

Thus broadly speaking, the more constrained the hypothesis class 7i, the smaller 
is the sample complexity (i.e. the easier it is to choose from finite experience the 
best hypothesis) but then again, the poorer is the expressive power and consequently 
even h^ might be far away from the reality c. On the other hand, increasing the 
expressive power of Ti might decrease di^h^^c) but increase the sample complexity. 
There is thus an inherent tension between the complexity of Ti and the number of 
examples; finding the class Ti of the right complexity is the challenge of science. Part 
of the understanding of biological phenomena involves deciding where on the tightrope 
between extremely complex and extremely simple models the true phenomena lie. In 
this respect, informational complexity is a powerful tool to help discriminate between 
models of different complexities to describe natural phenomena. 

One sterling example where this information-complexity approach has startlingly 
revised the kinds of models used can be found in the Chomskyan revolution in linguis- 
tics. Humans develop a mature knowledge of language which is both rich and subtle 
on the basis of example sentences spoken to them by parents and guardians during 
childhood. On observing the child language acquisition process, it is remarkable how 
few examples they need to be able to generalize in very sophisticated ways. Further 
it is observed that children generalize in roughly the same way; too striking a coin- 
cidence to be attributed purely to chance. Languages are infinite sets of sentences; 
yet on the basis of exposure to finite linguistic experience (sentences) children gen- 
eralize to the infinite set. If it were the case that children operated with completely 
unconstrained hypotheses about languages, i.e., if they were willing to consider all 
possible infinite extensions to the finite data set they had, then they would never be 
able to generalize correctly or generalize in the same manner. They received far too 
few examples for that. This "poverty of stimulus" in the child language acquisition 
process motivated Chomsky to suggest that children operate with hypotheses about 
language which are constrained in some fashion. In other words, we are genetically 
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Figure 1-1: The space of possibilities. The various factors which affect the informa- 
tional complexity of learning from examples. 



predisposed as human beings to choose certain generalizations and not others; we 
operate with a set of restricted hypotheses. The goal of linguistics then shifted to 
finding the class Ti with the right complexity; something which had large enough 
expressive power to capture the natural languages, and low enough to be learned by 
children. In this thesis we spend some time on models for learning languages. 

Thus we see that an investigation of the informational complexity of learning 
has implications for model building; something which is at the core of the scientific 
enterprise. Particularly when studying cognitive behavior, it might potentially allow 
us to choose the right complexity, i.e., how much processing is already built into the 
brain (the analog of Hubel and Wiesel's orientation-specific neurons or Chomsky's 
universal grammar) and how much is acquired by exposure to the environment. At 
this point, it would be worthwhile to point out that the complexity of Ti is only one 
of the factors influencing the informational complexity. Recall that we have already 
sharpened our notion of informational complexity to mean the number of examples 
needed by the learner so that di^hm^h^) is small. There are several factors which 
could in principle affect it and Figure 1.1 shows them as decomposed along several 
different dimensions in the space of possibilities. 

Clearly, informational complexity might depend upon upon the manner in which 
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examples are obtained. If one were learning to discriminate between the sounds 
/s/ and /sh/, for example, one could potentially learn more effectively if one were 
presented with examples drawn from near the decision boundary, i.e., examples of 
sounds which were likely to be confused. Such a presentation might conceivably 
help the learner acquire a sharper idea of the distinction between the two sounds 
rather than if it were simply presented with canonical examples of each phoneme. Of 
course, it might well be the case that our intuition is false in this case, but we will 
never know unless the issue is formally addressed. In similar fashion, the presence 
and nature of the noise corrupting the examples could affect sample complexity. In 
the case of s/sh classification, a lot of noise in high frequency bands of the signal 
could affect our perception of frication and might delay learning; on the other hand 
noise which only affects volume of the signal might have less effect. The algorithm 
used to compute a best hypothesis h m from the data might affect both learnability 
and sample complexity. A muddle-headed poorly motivated algorithm might choose 
hypotheses at random or it might choose hypotheses according to some criterion which 
has nothing to do with the metric d by which success is to be measured. In such cases, 
it is possible that h m might not converge to h^ at all, or it might take a very long 
time. Finally the metric d according to which success is to be measured is clearly a 
factor. 

These different factors interact with each other; our central goal in this thesis is 
to explore this possibility-space at many different points. We will return to this space 
and our points of exploration later. It is our hope that after seeing the interaction 
between the different dimensions and their relation to informational complexity, our 
intuitions about the analysis of learning paradigms will be sharpened. 

1.2 Parametric Hypothesis Spaces 

We have already introduced the notion of hypotheses and hypothesis classes from 
which these hypotheses are chosen. We have also remarked that the number of ex- 
amples needed to choose a "best" hypothesis (or at any rate, one close enough to 
the best according to our distance metric) depends inherently upon the complexity 
of these classes. Another related question of some interest is: how do we represent 
these hypotheses? One approach pervasive in science is to capture the degree of vari- 
ability amongst the hypotheses in a parametric fashion. The greater the flexibility 
of the parameterization, the greater the allowed variability and the less is the inbuilt 
constraints, i.e., the larger the domain and consequently the larger the search space. 
One can consider several other examples from the sciences where parametric models 
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have been developed for some task or other. 

In our thesis, we spend a considerable amount of time and energy on two paramet- 
ric models which are remarkably different in their structural properties and analyze 
issues of informational complexity in each. It is worthwhile perhaps to say a few 
words about each. 

Neural Networks 

Feed-forward "neural networks" (Lippman, 1987) are becoming increasingly popular 
in science and engineering as a modelling technique. We consider a class of feed- 
forward networks known as Gaussian regularization networks (Poggio and Girosi, 
1990). Essentially, such a network performs a mapping from 9£ fc to 3J given by the 
following expression 
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Fig. 1-2 shows a diagrammatic (it is particularly popular in the neural net communi- 
ties to show the diagrams or architecture and we see no need to break with tradition 
here) representation of the network. The c 8 's are real-valued, G is a Gaussian func- 
tion (activation function), the t;'s are the centers, and the cr 8 's are the spreads of the 
Gaussian functions. 

Clearly then, one can consider H n to be the class of all functions which can be 
represented in the form above. This class would consist of functions parameterized by 
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3n parameters; corresponding to the free variables c 8 -, t 8 -, and <7 8 -. One can make several 
alterations to the architecture; changing for example the number of layers, changing 
the activation functions, putting constraints on the weights and so on thereby arriving 
at different kinds of parameterized families, e.g., the multilayer perceptrons with 
sigmoidal units, hierarchical mixture of experts (Jacobs et al, 1991) etc. Such feed 
forward networks have been used for tasks as diverse as discriminating between virgin 
and non- virgin olive oil, speech recognition, predicting the stock market, robotic 
control and so forth. Given the prevalence of such neural networks, we have chosen in 
this thesis to investigate issues pertaining to informational complexity of networks. 

Natural Languages 

Natural languages can be described by their grammars which are essentially functional 
mappings from strings to the set {0, 1}. According to conventional notation, there is 
an alphabet set E which is a finite set of symbols. In the case of a particular natural 
language, like English, for example, this set is the vocabulary: a finite set of words. 
These symbols or words are the basic building blocks of sentences which are just 
strings of words. E* denotes the set of all finite sentences and a language I is a 
subset of E*, i.e., some collection of sentences which belong to the language. For 
example, in English,/ eat bananas is a sentence (an element of E*), being as it is a 
string of the three words (elements of E), /, eat } and bananas. Further, this sentence 
belongs to the set of valid English sentences. On the other other hand, the sentence 
/ bananas eat } though a member of E* is not a member of the set of valid English 
sentences. 

The grammar Gxassociated with the language L then is a functional description 
of the mapping from E* to {0, 1}, all sentences belonging to E* which belong to L 
are mapped onto 1 by Gx, the rest are assigned to 0. According to current theories of 
linguistics which we will consider in this thesis, it is profitable for analysis to let the 
set E consist of syntactic categories like verbs, adverbs, prepositions, nouns, and so 
on. A sentence could now be considered to be a string of such syntactic categories; 
each category then maps onto words of the vocabulary. Thus the string of syntactic 
categories Noun Verb Noun maps onto / eat bananas; the string Noun Noun Verb 
maps onto / bananas eat. A grammar is a systematic system of rules and principles 
which pick out some strings of syntactic categories as valid, others as not. Most of 
linguistic theory concentrates on generative grammars; grammars which are able to 
build the valid sentences out of the syntactic components according to certain rules. 
Phrase structure grammars build sentences out of phrases; and phrases out of other 
phrases or syntactic categories. 
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Over the last decade, a parametric theory of grammars (Chomsky, 1981) has 
begun to evolve. According to this, a grammar G(p\ } . . . } p n ) is parameterized by a 
finite (in this case, n) number of parameters pi through p n . If these parameters are 
set to one set of values, one would obtain the grammar of a specific language, say, 
German. Setting them to another set of values would define the grammar of another 
language, say English. To get a feel for what parameters are like, consider an example 
from X-bar theory; a subcomponent of grammars. According to X-bar theory, the 
structure of an XP or A-phrase (where A could stand for adjective, noun, verb, etc.) 
is given by the following context-free production rules which are parameterized by 
two parameters pi and p 2 . 

XP > Spec X'( Pl = 0) or A' Spec ( Pl = 1) 

A' — > Comp X\p 2 = 0) or A' Comp (p 2 = 1) 

A' > Comp X(p 2 = 0) or A Comp (p 2 = 1) 

Comp > YP 

For example, English is a comp-hnal language (p 2 = 1) while Bengali is a comp- 
hrst language (p 2 = 0). Notice how all the phrases (irrespective of whether it is a noun 
phrase, verb phrase etc.) in English have their complement in the end, while Bengali 
is the exact reverse. This is one example of a parameterized difference between the 
two languages. 

Also shown in figures 1-4, and 1-5, we have the tree diagrams corresponding to 
the sentence "with one hand" in English and Bengali. English is spec-first and comp- 
hnal (i.e., pi = and p 2 = 1); Bengali on the other hand is spec-first and comp-hrst 
(pi = and p 2 = 0). 

1.3 The Thesis: Technical Contents and Major 
Contributions 

So far we have discussed in very general terms, the various components of a learning 
paradigm and their relationship to each other. We have stated our intention of ana- 
lyzing the informational complexity of learning from examples; we have thus defined 
for ourselves the possibility space of Figure 1.1 that needs to be explored. In this 
thesis, we look at a few specific points in this space; in doing so, the issues involved 
in informational complexity can be precisely formalized and sharper results obtained. 
Chapters 2 and 3 of this thesis are completely self contained. Chapters 4 and 5 should 
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Figure 1-3: Parametric difference in phrase structure between English and Bengali 
on the basis of the parameter p 2 . 
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Figure 1-4: Analysis of the English sentence "with one hand" according to its param- 
eterized X-bar grammar. 
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"with one hand" according to its parameterized X-bar grammar. Notice the difference 
in word order. 
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be read as a unit; together they form another stand-alone part of this thesis. 

Chapter 2 of this thesis examines the use of neural networks of a certain kind 
(the so called regularization networks) in solving pattern classification and regression 
problems. This corresponds to a point in the space of Figure 1.1 where the concept 
class is a Sobolev space of functions, the hypothesis class is the class of all feed forward 
regularization networks (with certain restrictions on their weights), the examples are 
drawn according to a fixed, unknown, arbitrary probability distribution, the distance 
metric is a L 2 (P) norm on the space of functions, the algorithm used to choose the 
best hypothesis is by training a finite sized network on labelled examples according 
to least-squares criterion. The concept class is infinite-dimensional; on using a finite 
network and finite amount of data, a certain amount of generalization error is made. 
We observe that the generalization error can be decomposed into an approximation 
error due to the finite number of parameters of the network and an estimation error 
due to the finite number of data points. Using techniques from approximation theory 
and VC theory, we obtain a bound on the generalization error in terms of the number 
of parameters and number of examples. Our main contributions in this chapter 
include: 

• Formulation of the trade-off between hypothesis complexity and sample com- 
plexity when using Gaussian regularization networks. 

• Combining results from approximation theory and the theory of empirical pro- 
cesses to obtain a specific bound on the total generalization error as a function 
of the number of examples and number of parameters. 

• Using the bound above to provide guidelines for choosing an optimal network 
architecture to solve certain regression problems. 

Chapter 3 explores the issue of active learning. We are specifically interested in 
investigating whether allowing the learner to choose examples helps in learning with 
fewer examples. This chapter consists of two parts which include several forays into 
this question. The first part explores this issue in a function approximation setting. 
It is not immediately clear that even if the learner were allowed to choose his/her 
own examples, there exist principled ways of doing this. We develop a framework 
within which meaningful adaptive sampling strategies can be obtained for arbitrary 
function classes. As specific examples we consider cases where the concept classes 
are real-valued classes like monotonic functions and functions with bounded first 
derivative, hypothesis classes are spline functions, there is no noise, the learner chooses 
an interpolating spline as a best hypothesis and examples are obtained passively 
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(by random draw) or adaptively (according our strategy) by the active learner. We 
obtain theoretical and empirical bounds on the sample complexity and generalization 
error for this task. In the second part, we discuss the idea of epsilon-focusing; a 
strategy whereby the learner can adaptively focus on smaller and smaller regions of 
the domain to solve certain pattern classification problems. We derive conditions 
on function classes where epsilon-focusing would result in faster learning. Our main 
contributions here include: 

• A formulation of active learning in approximation theoretic terms as an adaptive 
approximation problem. 

• Development of active strategies for learning classes of real valued functions. 
These active strategies differ from traditional adaptive approximation strategies 
in optimal sampling theory in that examples are adaptively selected on the basis 
of previous examples as opposed to preselected on the basis of knowledge about 
the concept class. 

• Explicit computation of theoretical upper and lower bounds on the sample com- 
plexity of PAC learning real classes using passive and active strategies. Sim- 
ulations with some test target functions allows us to compare the empirical 
performance against the theoretical worst case bounds. 

• Introduction of the idea of epsilon-focusing which provides a theoretical mo- 
tivation for pattern classification schemes where more data is collected near 
the estimated class boundary. The computation of explicit sample complexity 
bounds for algorithms motivated by epsilon-focusing. 

Chapters 4 and 5 of this thesis concentrate on a very different region of the 
possibility space of Figure 1.1. Here the concept class is a restricted subclass of 
natural languages, the hypothesis class consists of parameterized grammars including 
X-bar theory, verb movement and case theory, examples are assumed to be drawn 
according to some distribution on the sentences of the target, there might or might 
not be noise, there is a discrete distance metric which requires exact identification of 
the target, the algorithm used to choose the best hypothesis is the Triggering Learning 
Algorithm (Gibson and Wexler, 1993). 

The TLA was proposed recently by Gibson and Wexler as a possible mechanism 
by which children set parameters and learned the language to which they were ex- 
posed. Chapter 4 originated as an attempt to analyze the TLA from the perspective 
of informational complexity and to derive conditions for convergence and rates of 
convergence of the TLA to the target. We explore the TLA and its variants under 
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the diverse influence of noise, distributional assumptions on the data, and explore the 
linguistic consequences of this. In Chapter 5, we study another important facet of the 
language learning puzzle. Starting with a set of grammars and a learning algorithm, 
we are able to derive a dynamical system whose states correspond to the the linguistic 
composition of the population, i.e., the relative percentage of people in a community 
speaking a particular language. For the TLA, we give the precise update rules for the 
states of this system, analyze conditions for stability and carry out several simula- 
tions in linguistically plausible systems. This serves as a formal model for describing 
the historical evolution of languages and formalizes ideas inherent in Lightfoot (1991) 
and and Hawkins and Gell-Mann (1989) for the first time. These two chapters make 
several important contributions including: 

• The development of a mathematical framework (a Markov structure) to formally 
study the issues relating to the learnability and sample complexity of the TLA. 

• The investigation of variants of TLA, the effect of noise, distributional assump- 
tions and parameterization of the space in a systematic manner on linguistically 
natural spaces. 

• The derivation of algorithm-independent bounds on the sample complexity us- 
ing results from computational learning theory. 

• The derivation of a linguistic dynamical system starting from the TLA operating 
on parameterized grammars. 

• Utilizing the dynamical system as a model for language change, running sim- 
ulations on linguistically natural spaces and comparison of the results against 
historically observed patterns. 

• Introduction of the diachronic criterion for deciding the plausibility of any learn- 
ing algorithm. 

1.3.1 A Final Word 

Over the last decade, there has been a explosion of interest in formal learning theory 
(see the Proceedings of ACM COLT for a whiff of this). This has brought in its wake a 
perspective on learning paradigms which we greatly share and this thesis reflects that 
perspective strongly. In addition, as with all interdisciplinary pieces of work, we have 
an intellectual debt to many different fields. The areas of approximation theory and 
statistics, particularly the part of empirical process theory beautifully worked out by 
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Vapnik and Chervonenkis, model selection, pattern recognition, decision theory, and 
nonparametric regression play an important role in Chapter 2. Ideas from adaptive 
integration and numerical analysis play an important role in chapter 3. Chapters 
4 and 5 have evolved from the application of our computational perspective to the 
analysis of learning paradigms which are considered worthwhile in linguistic theory 
(our decision of what is linguistically worthwhile has been influenced greatly by schol- 
arly works in the Chomskyan tradition). Here, there is some use of Markov chain 
theory and dynamical systems theory. In all of this, we have brought to bear well 
known results and techniques from different areas of mathematics to formally pose 
and answer questions of interest in human and machine learning; questions previously 
unposed or unanswered or both. In this strict sense, there is little new mathematics 
here; though an abundant demonstration of its usefulness as a research tool in the 
cognitive and computer sciences. This reflects our purpose and our intended audi- 
ence for this thesis, namely, all people interested in human or machine learning from 
a computational perspective. 
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Chapter 2 



On the Relationship Between Generaliza- 
tion Error, Hypothesis Complexity, and 
Sample Complexity in Radial Basis Func- 
tions 

Abstract 

Feedforward networks are a class of approximation techniques that can be used to learn to perform 
some tasks from a finite set of examples. The question of the capability of a network to generalize 
from a finite training set to unseen data is clearly of crucial importance. In this chapter, we bound the 
generalization error of a class of Radial Basis Functions, for certain well defined function learning 
tasks, in terms of the number of parameters and number of examples. We show that the total 
generalization error is partly due to the insufficient representational capacity of the network (because 
of the finite size of the network being used) and partly due to insufficient information about the 
target function because of the finite number of samples. Prior research has looked at representational 
capacity or sample complexity in isolation. In the spirit of A. Barron, H. White and S. Geman we 
develop a framework to look at both. While the bound that we derive is specific for Radial Basis 
Functions, a number of observations deriving from it apply to any approximation technique. Our 
result also sheds light on ways to choose an appropriate network architecture for a particular problem 
and the kinds of problems which can be effectively solved with finite resources, i.e., with finite number 
of parameters and finite amounts of data. 

2.1 Introduction 

Many problems in learning theory can be effectively modelled as learning an input 
output mapping on the basis of limited evidence of what this mapping might be. 
The mapping usually takes the form of some unknown function between two spaces 
and the evidence is often a set of labelled, noisy, examples i.e., (x, y) pairs which are 
consistent with this function. On the basis of this data set, the learner tries to infer 
the true function. 

We have discussed in Chapter 1, several examples from speech recognition, object 
recognition, and finance where such a scenario exists. At the risk of belaboring this 
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point consider two more examples which illustrate this approach. In economics, it is 
sometimes of interest to predict the future foreign currency rates on the basis of the 
past time series. There might be a function which captures the dynamical relation 
between past and future currency rates and one typically tries to uncover this relation 
from data which has been appropriately processed. Similarly in medicine, one might 
be interested in predicting whether or not breast cancer will recur in a patient within 
five years after her treatment. The input space might involve dimensions like the age 
of the patient, whether she has been through menopause, the radiation treatment 
previously used etc. The output space would be single dimensional boolean taking on 
values depending upon whether breast cancer recurs or not. One might collect data 
from case histories of patients and try to uncover the underlying function. 

The unknown target function is assumed to belong to some class T which using 
the terminology of computational learning theory we call the concept class. Typi- 
cal examples of concept classes are classes of indicator functions, boolean functions, 
Sobolev spaces etc. The learner is provided with a finite data set. One can make many 
assumptions about how this data set is collected but a common assumption which 
would suffice for our purposes is that the data is drawn by sampling independently 
the input output space (X X Y) according to some unknown probability distribution. 
On the basis of this data, the learner then develops a hypothesis (another function) 
about the identity of the target function i.e., it comes up with a function chosen from 
some class, say H (the hypothesis class) which best fits the data and postulates this to 
be the target. Hypothesis classes could also be of different kinds. For example, they 
could be classes of boolean functions, polynomials, linear functions, spline functions 
and so on. One such class which is being increasingly used for learning problems is 
the class of feedforward networks ((Lippmann, 1987; Hertz, Krogh, and Palmer, 1991; 
Girosi, Jones, and Poggio, 1993). A typical feedforward network is a parameterized 
function of the form 

n 

/( x ) = J2 c *#( x ; w ^) 

8 = 1 

where {c 8 }™ =1 and {w 8 }™ =1 are free parameters and H(-; •) is a given, fixed function 
(the "activation function"). Depending on the choice of the activation function one 
gets different network models, such as the most common form of "neural networks", 
the Multilayer Perceptron (Rumelhart, Hinton, and Williams, 1986; Cybenko, 1989; 
Lapedes, and Farmer, 1988; Hertz, Krogh, and Palmer, 1991; Hornik, Stinchcombe, 
and White, 1989; Funahashi, 1989; Mhaskar, and Micchelli, 1992; Mhaskar, 1993; 
Trie, and Miyake, 1988) , or the Radial Basis Functions network (Broomhead, and 
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Lowe, 1988; Dyn, 1987; Hardy, 1971,1990; Micchelli, 1986; Powell, 1990; Moody, 
and Darken, 1989; Poggio, and Girosi, 1990; Girosi, 1992; Girosi, Jones, and Poggio, 
1993). 

If, as more and more data becomes available, the learner's hypothesis becomes 
closer and closer to the target and converges to it in the limit, the target is said to 
be learnable. The error between the learner's hypothesis and the target function is 
defined to be the generalization error and for the target to be learnable the gener- 
alization error should go to zero as the data goes to infinity. While learnability is 
certainly a very desirable quality, it requires the fulfillment of two important criteria. 

First, there is the issue of the representational capacity (or hypothesis complexity) 
of the hypothesis class. This must have sufficient power to represent or closely approx- 
imate the concept class. Otherwise for some target function /, the best hypothesis h 
in H might be far away from it. The error that this best hypothesis makes is formal- 
ized later as the approximation error. In this case, all the learner can hope to do is 
to converge to h in the limit of infinite data and so it will never recover the target. 
Second, we do not have infinite data but only some finite random sample set from 
which we construct a hypothesis. This hypothesis constructed from the finite data 
might be far from the best possible hypothesis, h, resulting in a further error. This 
additional error (caused by finiteness of data) is formalized later as the estimation 
error. The amount of data needed to ensure a small estimation error is referred to as 
the sample complexity of the problem. The hypothesis complexity, the sample com- 
plexity and the generalization error are related. If the class H is very large or in other 
words has high complexity, then for the same estimation error, the sample complexity 
increases. If the hypothesis complexity is small, the sample complexity is also small 
but now for the same estimation error the approximation error is high. This point 
has been developed in terms of the Bias- Variance trade-off in (Geman, Bienenstock, 
and Doursat, 1992) in the context of neural networks, and others (Rissanen, 1983; 
Grenander, 1951; Vapnik, 1982; Stone, 1974) in statistics in general. 

The purpose of this chapter is two-fold. First, we formalize the problem of learning 
from examples so as to highlight the relationship between hypothesis complexity, 
sample complexity and total error. Second, we explore this relationship in the specific 
context of a particular hypothesis class. This is the class of Radial Basis function 
networks which can be considered to belong to the broader class of feed-forward 
networks. Specifically, we are interested in asking the following questions about radial 
basis functions. 

Imagine you were interested in solving a particular problem (regression or pattern 
classification) using Radial Basis Function networks. Then, how large must the net- 
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work be and how many examples do you need to draw so that you are guaranteed with 
high confidence to do very well? Conversely, if you had a finite network and a finite 
amount of data, what are the kinds of problems you could solve effectively? 

Clearly, if one were using a network with a finite number of parameters, then its 
representational capacity would be limited and therefore even in the best case we 
would make an approximation error. Drawing upon results in approximation theory 
(Lorentz, 1986) several researchers (Cybenko, 1989; Hartman, Keeler, and Kowalski, 
1989; Barron, 1991; Hornik, Stinchcombe, and White, 1989; Chui, and Li, 1990; Arai, 
1989; Mhaskar, and Micchelli, 1992; Mhaskar, 1993; Irie, and Miyake, 1988; Chen, 
Chen, and Liu, 1990) have investigated the approximating power of feedforward net- 
works showing how as the number of parameters goes to infinity, the network can 
approximate any continuous function. These results assume infinite data and ques- 
tions of learnability from finite data are ignored. For a finite network, due to hniteness 
of the data, we make an error in estimating the parameters and consequently have an 
estimation error in addition to the approximation error mentioned earlier. Using re- 
sults from Vapnik and Chervonenkis (Vapnik, 1982; Vapnik, and Chervonenkis, 1971, 
1981, 1991) and Pollard (Pollard, 1984) , work has also been done (Haussler, 1989; 
Baum, and Haussler, 1988) on the sample complexity of finite networks showing how 
as the data goes to infinity, the estimation error goes to zero i.e., the empirically opti- 
mized parameter settings converge to the optimal ones for that class. However, since 
the number of parameters are fixed and finite, even the optimal parameter setting 
might yield a function which is far from the target. This issue is left unexplored by 
Haussler (1989) in an excellent investigation of the sample complexity question. 

In this chapter, we explore the errors due to both hnite parameters and hnite 
data in a common setting. In order for the total generalization error to go to zero, 
both the number of parameters and the number of data have to go to infinity, and we 
provide rates at which they grow for learnability to result. Further, as a corollary, we 
are able to provide a principled way of choosing the optimal number of parameters 
so as to minimize expected errors. It should be mentioned here that White (1990) 
and Barron (1991) have provided excellent treatments of this problem for different 
hypothesis classes. We will mention their work at appropriate points in this chapter. 

The plan of the chapter is as follows: in section 2.2 we will formalize the problem 
and comment on issues of a general nature. We then provide in section 2.3 a precise 
statement of a specific problem. In section 2.4 we present our main result, whose 
proof is postponed to appendix 2-D for continuity of reading. The main result is 
qualified by several remarks in section 2.5. In section 2.6 we will discuss what could 
be the implications of our result in practice and finally we conclude in section 2.7 
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with a reiteration of our essential points. 

2.2 Definitions and Statement of the Problem 

In order to make a precise statement of the problem we hrst need to introduce some 
terminology and to define a number of mathematical objects. A summary of the most 
common notations and definitions used in this chapter can be found in appendix 2-A. 

2.2.1 Random Variables and Probability Distributions 

Let X and Y be two arbitrary sets. We will call x and y the independent variable and 
response respectively, where x and y range over the generic elements of X and Y. In 
most cases X will be a subset of a ^-dimensional Euclidean space and Y a subset of 
the real line, so that the independent variable will be a ^-dimensional vector and the 
response a real number. We assume that a probability distribution P(x, y) is defined 
on X X Y . P is unknown, although certain assumptions on it will be made later in 
this section. 

The probability distribution P(x,j/) can also be written as 4 : 

P(x,j/) = P(x)P(j/|x), (2.1) 

where P(j/|x) is the conditional probability of the response y given the independent 
variable x, and P(x) is the marginal probability of the independent variable given 
by: 



P(*) = j Y dyP(*,y) 



Expected values with respect to P(x,j/) or P(x) will be always indicated by E[- 



Therefore, we will write: 



E[g{^ y)] = / dxdy P(x, y)g(x, y) 

JXxY 



an 



d 



E[h(x)] = I d-x P(x)/i(x) 
for any arbitrary function g or h. 



4 Note that we are assuming that the conditional distribution exists, but this is not a very restric- 
tive assumption. 
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2.2.2 Learning from Examples and Estimators 

The framework described above can be used to model the fact that in the real world we 
often have to deal with sets of variables that are related by a probabilistic relationship. 
For example, y could be the measured torque at a particular joint of a robot arm, 
and x the set of angular position, velocity and acceleration of the joints of the arm in 
a particular configuration. The relationship between x and y is probabilistic because 
there is noise affecting the measurement process, so that two different torques could 
be measured given the same configuration. 

In many cases we are provided with examples of this probabilistic relationship, 
that is with a data set Di, obtained by sampling / times the set X X Y according to 
P(x,j/): 

A = {(x,-,3/,-)<EXxy}j =1 . 

From eq. (2.1) we see that we can think of an element (x;, j/ 8 ) of the data set D\ as 
obtained by sampling X according to -P(x), and then sampling Y according to P(j/|x). 
In the robot arm example described above, it would mean that one could move the 
robot arm into a random configuration Xi, measure the corresponding torque j/i, and 
iterate this process / times. 

The interesting problem is, given an instance of x that does not appear in the 
data set Di, to give an estimate of what we expect y to be. For example, given a 
certain configuration of the robot arm, we would like to estimate the corresponding 
torque. 

Formally, we define an estimator to be any function / : X — >■ Y . Clearly, since the 
independent variable x need not determine uniquely the response j/, any estimator 
will make a certain amount of error. However, it is interesting to study the problem of 
finding the best possible estimator, given the knowledge of the data set Di, and this 
problem will be defined as the problem of learning from examples, where the examples 
are represented by the data set D\. Thus we have a probabilistic relation between x 
and y. One can think of this as an underlying deterministic relation corrupted with 
noise. Hopefully a good estimator will be able to recover this relation. 

2.2.3 The Expected Risk and the Regression Function 

In the previous section we explained the problem of learning from examples and stated 
that this is the same as the problem of finding the best estimator. To make sense of 
this statement, we now need to define a measure of how good an estimator is. Suppose 
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we sample X X Y according to P(x, y), obtaining the pair (x, y). A measure 5 of the 
error of the estimator / at the point x is: 

(?/-/(x)) 2 . 

In the example of the robot arm, /(x) is our estimate of the torque corresponding to 
the configuration x, and y is the measured torque of that configuration. The average 
error of the estimator / is now given by the functional 

I[f] = E[(y - /(x)) 2 ] = / dxdy P(x, y)(y - /(x)) 2 , 

■JXxY 

that is usually called the expected risk of / for the specific choice of the error measure. 

Given this particular measure as our yardstick to evaluate different estimators, 
we are now interested in finding the estimator that minimizes the expected risk. 
In order to proceed we need to specify its domain of definition T . Then using the 
expected risk as a criterion, we could obtain the best element of T . Depending on the 
properties of the unknown probability distribution P(x, y) one could make different 
choices for T . We will assume in the following that T is some space of differentiable 
functions. For example, T could be a space of functions with a certain number of 
bounded derivatives (the spaces A m (R d ) defined in appendix 2-A), or a Sobolev space 
of functions with a certain number of derivatives in L p (the spaces H m,p (R d ) defined 
in appendix 2-A). 

Assuming that the problem of minimizing /[/] in T is well posed, it is easy to 
obtain its solution. In fact, the expected risk can be decomposed in the following way 
(see appendix 2-B): 

I[f] = E[(/ (x) - /(x)) 2 ] + E[(y - /o(x)) 2 ] (2.2) 

where /o(x) is the so called regression function, that is the conditional mean of the 
response given the independent variable: 



/o(x)= / dyyP(y\x). (2.3) 

•J i 

From eq. (2.2) it is clear that the regression function is the function that minimizes 
the expected risk in J 7 , and is therefore the best possible estimator. Hence, 



5 Note that this is the familiar squared-error and when averaged over its domain yields the mean 
squared error for a particular estimator, a very common choice. However, it is useful to remember 
that there could be other choices as well. 
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/o(x) = axgmin /[/] . 

However, it is also clear that even the regression function will make an error equal 
to E[(y — / (x)) 2 ], that is the variance of the response given a certain value for the 
independent variable, averaged over the values the independent variable can take. 
While the hrst term in eq. (2.2) depends on the choice of the estimator /, the second 
term is an intrinsic limitation that comes from the fact that the independent variable 
x does not determine uniquely the response y. 

The problem of learning from examples can now be reformulated as the problem 
of reconstructing the regression function / , given the example set D\. Thus we have 
some large class of functions T to which the target function f belongs. We obtain 
noisy data of the form (x, y) where x has the distribution -P(x) and for each x, y is 
a random variable with mean /o(x) and distribution P(y\x). We note that y can be 
viewed as a deterministic function of x corrupted by noise. If one assumes the noise 
is additive, we can write y = /o(x) + rj x where rj x 6 is zero-mean with distribution 
P(j/|x). We choose an estimator on the basis of the data set and we hope that 
it is close to the regression (target) function. It should also be pointed out that 
this framework includes pattern classification and in this case the regression (target) 
function corresponds to the Bayes discriminant function (Gish, 1990; Hampshire, and 
Pearlmutter, 1990; Richard, and Lippman, 1991) . 

2.2.4 The Empirical Risk 

If the expected risk functional /[/] were known, one could compute the regression 
function by simply finding its minimum in J 7 , that would make the whole learning 
problem considerably easier. What makes the problem difficult and interesting is 
that in practice /[/] is unknown because P(x, y) is unknown. Our only source of 
information is the data set D\ which consists of / independent random samples of 
X X Y drawn according to P(x,j/). Using this data set, the expected risk can be 
approximated by the empirical risk I e mp- 

1 ' 



I emp [f] = 1 J2(y t -f(x i )) 2 . 

8 = 1 

For each given estimator /, the empirical risk is a random variable, and under fairly 



6 Note that the standard regression problem often assumes r\ x is independent of x. Our case is 
distribution free because we make no assumptions about the nature of r\ x . 
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general assumptions 7 , by the law of large numbers (Dudley, 1989) it converges in 
probability to the expected risk as the number of data points goes to infinity: 

lim P{\I[f] - 7 emp [/]| > e} = Ve > . (2.4) 

Therefore a common strategy consists in estimating the regression function as the 
function that minimizes the empirical risk, since it is "close" to the expected risk if 
the number of data is high enough. For the error metric we have used, this yields 
the least-squares error estimator. However, eq. (2.4) states only that the expected 
risk is "close" to the empirical risk for each given /, and not for all / simultaneously. 
Consequently the fact that the empirical risk converges in probability to the expected 
risk when the number, /, of data points goes to infinity does not guarantee that the 
minimum of the empirical risk will converge to the minimum of the expected risk 
(the regression function). As pointed out and analyzed in the fundamental work of 
Vapnik and Chervonenkis the notion of uniform convergence in probability has to be 
introduced, and it will be discussed in other parts of this chapter. 

2.2.5 The Problem 

The argument of the previous section suggests that an approximate solution of the 
learning problem consists in finding the minimum of the empirical risk, that is solving 

min/ emp [/] . 

However this problem is clearly ill-posed, because, for most choices of J 7 , it will have 
an infinite number of solutions. In fact, all the functions in T that interpolate the 
data points (x;, j/ 8 ), that is with the property 

/(x,-) = y l 1,...,/ 

will give a zero value for / e m P - This problem is very common in approximation theory 
and statistics and can be approached in several ways. A common technique consists 
in restricting the search for the minimum to a smaller set than T . We consider the 
case in which this smaller set is a family of -parametric functions^ that is a family of 
functions defined by a certain number of real parameters. The choice of a parametric 
representation also provides a convenient way to store and manipulate the hypothesis 
function on a computer. 

We will denote a generic subset of T whose elements are parametrized by a number 



Tor example, assuming the data is independently drawn and /[/] is finite. 

44 



of parameters proportional to n, by H n . Moreover, we will assume that the sets H n 
form a nested family, that is 

HiCH2C.CHnC.CH. 

For example, H n could be the set of polynomials in one variable of degree n — 1, Radial 
Basis Functions with n centers, multilayer perceptrons with n sigmoidal hidden units, 
multilayer perceptrons with n threshold units and so on. Therefore, we choose as 
approximation to the regression function the function f n j defined as: 8 

fn,i = arg mm I emp [f] • (2.5) 

Thus, for example, if H n is the class of functions which can be represented as / = 
Yla=i c aH(x.;w a ) then eq. (2.5) can be written as 

/„,/ = arg mm I emp [f] • 

A number of observations need to be made here. First, if the class T is small (typically 
in the sense of bounded VC-dimension or bounded metric entropy (Pollard, 1984) ), 
then the problem is not necessarily ill-posed and we do not have to go through the 
process of using the sets H n . However, as has been mentioned already, for most inter- 
esting choices of T (e.g. classes of functions in Sobolev spaces, continuous functions 
etc.) the problem might be ill posed. However, this might not be the only reason 
for using the classes H n . It might be the case that that is all we have or for some 
reason it is something we would like to use. For example, one might want to use a 
particular class of feed-forward networks because of ease of implementation in VLSI. 
Also, if we were to solve the function learning problem on a computer as is typically 
done in practice, then the functions in T have to be represented somehow. We might 
consequently use H n as a representation scheme. It should be pointed out that the 
sets H n and T have to be matched with each other. For example, we would hardly 
use polynomials as an approximation scheme when the class T consists of indicator 
functions or for that matter use threshold units when the class T contains continuous 



8 Notice that we are implicitly assuming that the problem of minizing I emp [/] over H n has a 
solution, which might not be the case. However the quantity 

E„ t i= inf Iemp[/] 
f€H n 

is always well defined, and we can always find a function f n j for which J e mp[/n,i] is arbitrarily close 
to E n j. It will turn out that this is sufficient for our purposes, and therefore we will continue, 
assuming that f n j is well defined by eq. (2.5) 
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functions. In particular, if we are to recover the regression function, H must be dense 
in T . One could look at this matching from both directions. For a class J 7 , one might 
be interested in an appropriate choice of H n . Conversely, for a particular choice of 
H n} one might ask what classes T can be effectively solved with this scheme. Thus, 
if we were to use multilayer perceptrons, this line of questioning would lead us to 
identify the class of problems which can be effectively solved by them. 

Thus, we see that in principle we would like to minimize /[/] over the large 
class T obtaining thereby the regression function f . What we do in practice is to 
minimize the empirical risk / em p[/] over the smaller class H n obtaining the function 
f n j. Assuming we have solved all the computational problems related to the actual 
computation of the estimator f n j } the main problem is now: 

how good is f n i? 

Independently of the measure of performance that we choose when answering this 
question, we expect f n j to become a better and better estimator as n and / go to 
infinity. In fact, when / increases, our estimate of the expected risk improves and our 
estimator improves. The case of n is trickier. As n increases, we have more parameters 
to model the regression function, and our estimator should improve. However, at the 
same time, because we have more parameters to estimate with the same amount of 
data, our estimate of the expected risk deteriorates. Thus we now need more data and 
n and / have to grow as a function of each other for convergence to occur. At what 
rate and under what conditions the estimator f n j improves depends on the properties 
of the regression function, that is on J 7 , and on the approximation scheme we are 
using, that is on H n . 

2.2.6 Bounding the Generalization Error 

At this stage it might be worthwhile to review and remark on some general features of 
the problem of learning from examples. Let us remember that our goal is to minimize 
the expected risk /[/] over the set T . If we were to use a finite number of parameters, 
then we have already seen that the best we could possibly do is to minimize our 
functional over the set H n} yielding the estimator f n : 

f n = arg mm /[/] . 

However, not only is the parametrization limited, but the data is also finite, and we 
can only minimize the empirical risk / e m P , obtaining as our final estimate the function 
f n j. Our goal is to bound the distance from f n j that is our solution, from f 0} that is 
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the "optimal" solution. If we choose to measure the distance in the L 2 (P) metric (see 
appendix 2-A), the quantity that we need to bound, that we will call generalization 
error, is: 

E[(fo - Lif] = I x d* P(x)(/ (x) - / n ,,(x)) 2 = 

= ||/o - fn,l\\l?{P) 

There are 2 main factors that contribute to the generalization error, and we are going 
to analyze them separately for the moment. 

1. A hrst cause of error comes from the fact that we are trying to approximate an 
infinite dimensional object, the regression function f £ J 7 , with a finite number 
of parameters. We call this error the approximation error, and we measure it by 
the quantity E[(f — f n ) 2 ], that is the L 2 (P) distance between the best function 
in H n and the regression function. The approximation error can be expressed 
in terms of the expected risk using the decomposition (2.2) as 

E[(fo ~ fn) 2 } = /[/„] - I[fo] ■ (2.6) 

Notice that the approximation error does not depend on the data set Di, but de- 
pends only on the approximating power of the class H n . The natural framework 
to study it is approximation theory, that abound with bounds on the approx- 
imation error for a variety of choices of H n and T . In the following we will 
always assume that it is possible to bound the approximation error as follows: 

E[(fo - fn) 2 } < e{n) 

where e(n) is a function that goes to zero as n goes to infinity if H is dense in 
T . In other words, as shown in figure (2-6), as the number n of parameters gets 
larger the representation capacity of H n increases, and allows a better and better 
approximation of the regression function f . This issue has been studied by a 
number of researchers (Cybenko, 1989; Hornik, Stinchcombe, and White, 1989; 
Barron, 1991, 1993; Funahashi, 1989; Mhaskar, and Micchelli, 1992; Mhaskar, 
1993) in the neural networks community. 

2. Another source of error comes from the fact that, due to finite data, we minimize 
the empirical risk J emp [/], and obtain f n j, rather than minimizing the expected 
risk /[/], and obtaining f n . As the number of data goes to infinity we hope that 
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f n j will converge to f n} and convergence will take place if the empirical risk 
converges to the expected risk uniformly in probability (Vapnik, 1982) . The 
quantity 

IW/]-/[/]| 

is called estimation error, and conditions for the estimation error to converge 
to zero uniformly in probability have been investigated by Vapnik and Cher- 
vonenkis Pollard , Dudley (1987) , and Haussler (1989) . Under a variety of 
different hypothesis it is possible to prove that, with probability 1 — <5, a bound 
of this form is valid: 

\I emp [f]-I[f]\<u(l,n,8) WfeH n (2.7) 

The specific form of u depends on the setting of the problem, but, in general, we 
expect u(l, n, 6) to be a decreasing function of /. However, we also expect it to 
be an increasing function of n. The reason is that, if the number of parameters 
is large then the expected risk is a very complex object, and then more data 
will be needed to estimate it. Therefore, keeping fixed the number of data and 
increasing the number of parameters will result, on the average, in a larger 
distance between the expected risk and the empirical risk. 

The approximation and estimation error are clearly two components of the gen- 
eralization error, and it is interesting to notice, as shown in the next statement, the 
generalization error can be bounded by the sum of the two: 

Statement 2.2.1 The following inequality holds: 

\\fo-fn,i\\h(P)<<n) + Mhn,6) ■ (2.8) 

Proof: using the decomposition of the expected risk (2.2), the generalization error 
can be written as: 

ll/o " ki\\h { p) = E[(fo ~ Lif] = /[/„,,] ~ I[M • (2-9) 

A natural way of bounding the generalization error is as follows: 

E[(fo ~ fn,lf] < \I[fn] ~ /[/o]| + |/[/„] " I[fnj}\ ■ (2-10) 
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In the first term of the right hand side of the previous inequality we recognize the 
approximation error (2.6). If a bound of the form (2.7) is known for the generalization 
error, it is simple to show (see appendix (2-C) that the second term can be bounded 
as 

\I[fn]-I[fn,l\\<Ml,n,8) 

and statement (2.2.1) follows □. 

Thus we see that the generalization error has two components: one, bounded 
by e(n), is related to the approximation power of the class of functions {H n } } and is 
studied in the framework of approximation theory. The second, bounded by u(l, n, <5), 
is related to the difficulty of estimating the parameters given finite data, and is studied 
in the framework of statistics. Consequently, results from both these fields are needed 
in order to provide an understanding of the problem of learning from examples. Figure 
(2-6) also shows a picture of the problem. 




Figure 2-6: This figure shows a picture of the problem. The outermost circle repre- 
sents the set F. Embedded in this are the nested subsets, the H n } s. f is an arbitrary 
target function in J 7 , f n is the closest element of H n and f n j is the element of H n 
which the learner hypothesizes on the basis of data. 
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2.2.7 A Note on Models and Model Complexity 

From the form of eq. (2.8) the reader will quickly realize that there is a trade-off 
between n and / for a certain generalization error. For a fixed /, as n increases, the 
approximation error e(n) decreases but the estimation error cu(7, n, 6) increases. Con- 
sequently, there is a certain n which might optimally balance this trade-off. Note that 
the classes H n can be looked upon as models of increasing complexity and the search 
for an optimal n amounts to a search for the right model complexity. One typically 
wishes to match the model complexity with the sample complexity (measured by how 
much data we have on hand) and this problem is well studied (Eubank, f 988; Stone, 
1974; Linehart, and Zucchini, 1986, Rissanen, 1989; Barron, and Cover, 1989; Efron, 
1982; Craven, and Wahba, 1979) in statistics. 

Broadly speaking, simple models would have high approximation errors but small 
estimation errors while complex models would have low approximation errors but high 
estimation errors. This might be true even when considering qualitatively different 
models and as an illustrative example let us consider two kinds of models we might use 
to learn regression functions in the space of bounded continuous functions. The class 
of linear models, i.e., the class of functions which can be expressed as / = w-x + #, do 
not have much approximating power and consequently their approximation error is 
rather high. However, their estimation error is quite low. The class of models which 
can be expressed in the form H = Ya=\ c i s ^ n ( w i ' x + @i) have higher approximating 
power (Jones, 1990) resulting in low approximation errors. However this class has an 
infinite VC-dimension and its estimation error can not therefore be bounded. 

So far we have provided a very general characterization of this problem, without 
stating what the sets T and H n are. As we have already mentioned before, the set 
T could be a set of bounded differentiable or integrable functions, and H n could be 
polynomials of degree n, spline functions with n knots, multilayer perceptrons with 
n hidden units or any other parametric approximation scheme with n parameters. In 
the next section we will consider a specific choice for these sets, and we will provide 
a bound on the generalization error of the form of eq. (2.8). 

2.3 Stating the Problem for Radial Basis Func- 
tions 

As mentioned before the problem of learning from examples reduces to estimating 
some target function from a set A to a set Y. In most practical cases, such as character 
recognition, motor control, time series prediction, the set A is the ^-dimensional 
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Euclidean space R , and the set Y is some subset of the real line, that for our purposes 
we will assume to be the interval [— Af, Af], where M is some positive number. In 
fact, there is a probability distribution P(x,j/) defined on the space R k X [— M, M] 
according to which the labelled examples are drawn independently at random, and 
from which we try to estimate the regression (target) function. It is clear that the 
regression function is a real function of k variables. 

In this chapter we focus our attention on the Radial Basis Functions approximation 
scheme (also called Hyper-Basis Functions; Poggio and Girosi, 1990 ). This is the 
class of approximating functions that can be written as: 

/(x) = 5>G(L_^) (2.H) 

where G is some given basis function (in our case Gaussian, specifically G(a) = 
Ve~ a ) and the /3 8 -, t 8 -, and <7 8 - are free parameters. We would like to understand what 
classes of problems can be solved "well" by this technique, where "well" means that 
both approximation and estimation bounds need to be favorable. It is possible to 
show that a favorable approximation bound can be obtained if we assume that the 
class of functions T to which the regression function belongs is defined as follows: 

f={f\f=\*G m ,m>k/2,\\\ Rk <M} . (2.12) 

Here A is a signed Radon measure on the Borel sets of R k , G m is the Bessel-Macdonald 
kernel, i.e., the inverse fourier transform of 

G m ( S ) 



(l+47T 2 ||s|| 2 ) m /2 

The symbol * stands for the convolution operation, |A|^ is the total variation 9 of 
the measure A and M is a positive real number. The space T as defined in eq. 2.12 is 
the so-called Liouville Space of order m. If m is even, this contains the Sobolev Space 
H™ 1 ' 1 of functions whose derivatives upto order m are integrable. 

We point out that the class T is non-trivial to learn in the sense that it has infinite 
pseudo-dimension (Pollard, 1984). 

In order to obtain an estimation bound we need the approximating class to have 
bounded variation, and the following constraint will be imposed: 



9 A signed measure A can be decomposed by the Hahn-Jordan decomposition into A = A + — A 
Then |A| = A + + A - is called the total variation of A. See Dudley (1989) for more information. 
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8 = 1 

This constraint does not affect the approximation bound, and the two pieces fit to- 
gether nicely. Thus the set H n is defined now as the set of functions belonging to L 2 
such that 

71 llv — t-ll n 
/(x) = Y,PiG(- % E IAI < M , t,- G i? fc , (Ti G R (2.13) 

8 = 1 ^ 8 = 1 

Having defined the sets H n and J 7 we remind the reader that our goal is to recover 
the regression function, that is the minimum of the expected risk over T . What we 
end up doing is to draw a set of / examples and to minimize the empirical risk J emp 
over the set H n , that is to solve the following non-convex minimization problem: 

' n ll-v-. f II 

/„,, = arg min $>,- - ^ ^ a G(— ^)) 2 (2-14) 

Notice that assumption that the regression function 



/o(x) = E[j/|x] 

belongs to the class T correspondingly implies an assumption on the probability 
distribution P(j/|x), viz., that P must be such that _E[j/|x] belongs to T . Notice also 
that since we assumed that Y is a closed interval, we are implicitly assuming that 
P(j/|x) has compact support. 

Assuming now that we have been able to solve the minimization problem of eq. 
(2.14), the main question we are interested in is "how far is f n j from /o?". We give 
an answer in the next section. 

2.4 Main Result 

The main theorem is: 

Theorem 2.4.1 For any < 8 < 1, for n nodes, I data points, input dimensionality 
of k, and H n} J-' } / , f n j also as defined in the statement of the problem above, with 
probability greater than 1 — 8, 



\\f0-fnAh(P)<O(-)+O 



n 



nkln(nl) — In 8 
1 
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Proof: The proof requires us to go through a series of propositions and lemmas which 
have been relegated to appendix (2-D) for continuity of ideas. □ 

2.5 Remarks 

There are a number of comments we would like to make on the formulation of our 
problem and the result we have obtained. There is a vast body of literature on 
approximation theory and the theory of empirical risk minimization. In recent times, 
some of the results in these areas have been applied by the computer science and 
neural network community to study formal learning models. Here we would like to 
make certain observations about our result, suggest extensions and future work, and 
to make connections with other work done in related areas. 

2.5.1 Observations on the Main Result 

• The theorem has a PAC (Valiant, 1984) like setting. It tells us that if we 
draw enough data points (labelled examples) and have enough nodes in our 
Radial Basis Functions network, we can drive our error arbitrarily close to 
zero with arbitrarily high probability. Note however that our result is not 
entirely distribution-free. Although no assumptions are made on the form of 
the underlying distribution, we do have certain constraints on the kinds of 
distributions for which this result holds. In particular, the distribution is such 
that its conditional mean _E[j/|x] (this is also the regression function fo(x)) 
must belong to a the class of functions T defined by eq. (2.12). Further the 
distribution P(j/|x) must have compact support 10 . 

• The error bound consists of two parts, one (0(l/n)) coming from approxima- 
tion theory, and the other 0(((nkln(nl) + lr^l/^))//) 1 ' 2 ) from statistics. It is 
noteworthy that for a given approximation scheme (corresponding to {H n }) } a 
certain class of functions (corresponding to J 7 ) suggests itself. So we have gone 
from the class of networks to the class of problems they can perform as opposed 
to the other way around, i.e., from a class of problems to an optimal class of 
networks. 



10 This condition, that is related to the problem of large deviations , could be relaxed, and will be 
subject of further investigations. 
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• This sort of a result implies that if we have the prior knowledge that f belongs to 
class J 7 , then by choosing the number of data points, /, and the number of basis 
functions, n, appropriately, we can drive the misclassihcation error arbitrarily 
close to Bayes rate. In fact, for a fixed amount of data, even before we have 
started looking at the data, we can pick a starting architecture, i.e., the number 
of nodes, n, for optimal performance. After looking at the data, we might be 
able to do some structural risk minimization (Vapnik, 1982) to further improve 
architecture selection. For a fixed architecture, this result sheds light on how 
much data is required for a certain error performance. Moreover, it allows us 
to choose the number of data points and number of nodes simultaneously for 
guaranteed error performances. Section 2.6 explores this question in greater 
detail. 

2.5.2 Extensions 

• There are certain natural extensions to this work. We have essentially proved 
the consistency of the estimated network function f n j. In particular we have 
shown that f n j converges to f with probability I as / and n grow to infinity. 
It is also possible to derive conditions for almost sure convergence. Further, 
we have looked at a specific class of networks ({H n }) which consist of weighted 
sums of Gaussian basis functions with moving centers but fixed variance. This 
kind of an approximation scheme suggests a class of functions T which can be 
approximated with guaranteed rates of convergence as mentioned earlier. We 
could prove similar theorems for other kinds of basis functions which would have 
stronger approximation properties than the class of functions considered here. 
The general principle on which the proof is based can hopefully be extended to 
a variety of approximation schemes. 

• We have used notions of metric entropy and covering number (Dudley, 1987; 
Pollard, 1984) in obtaining our uniform convergence results. Haussler (1989) 
uses the results of Pollard and Dudley to obtain uniform convergence results and 
our techniques closely follow his approach. It should be noted here that Vapnik 
deals with exactly the same question and uses the VC-dimension instead. It 
would be interesting to compute the VC-dimension of the class of networks and 
use it to obtain our results. 

• While we have obtained an upper bound on the error in terms of the number 
of nodes and examples, it would be worthwhile to obtain lower bounds on the 
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same. Such lower bounds do not seem to exist in the neural network literature 
to the best of our knowledge. 

• We have considered here a situation where the estimated network i.e., f n j is 
obtained by minimizing the empirical risk over the class of functions H n . Very 
often, the estimated network is obtained by minimizing a somewhat different 
objective function which consists of two parts. One is the fit to the data and 
the other is some complexity term which favors less complex (according to the 
defined notion of complexity) functions over more complex ones. For example 
the regularization approach (Tikhonov, 1963; Poggio and Girosi, 1992; Wahba, 
1990) minimizes a cost function of the form 

N 

H[f] = J2(m-f(*i) + mf} 

8 = 1 

over the class H = U n >iH n . Here A is the so called "regularization parameter" 
and $[/] is a functional which measures smoothness of the functions involved. 
It would be interesting to obtain convergence conditions and rates for such 
schemes. Choice of an optimal A is an interesting question in regularization 
techniques and typically cross-validation or other heuristic schemes are used. A 
result on convergence rate potentially offers a principled way to choose A. 

• Structural risk minimization is another method to achieve a trade-off between 
network complexity (corresponding to n in our case) and fit to data. However it 
does not guarantee that the architecture selected will be the one with minimal 
parameterization 11 . In fact, it would be of some interest to develop a sequential 
growing scheme. Such a technique would at any stage perform a sequential 
hypothesis test (Govindarajulu, 1975) . It would then decide whether to ask 
for more data, add one more node or simply stop and output the function it 
has as its e-good hypothesis. In such a process, one might even incorporate 
active learning (Angluin, 1988) so that if the algorithm asks for more data, 
then it might even specify a region in the input domain from where it would 
like to see this data. It is conceivable that such a scheme would grow to minimal 
parameterization (or closer to it at any rate) and require less data than classical 
structural risk minimization. 



n Neither does regularization for that matter. The question of minimal parameterization is related 
to that of order determination of systems, a very difficult problem! 
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• It should be noted here that we have assumed that the empirical risk J2i=i(yi ~ 
f(xi)) 2 can be minimized over the class H n and the function f n j be effectively 
computed. While this might be fine in principle, in practice only a locally 
optimal solution to the minimization problem is found (typically using some 
gradient descent schemes). The computational complexity of obtaining even 
an approximate solution to the minimization problem is an interesting one and 
results from computer science (Judd, 1988; Blum and Rivest, 1988) suggest that 
it might in general be iVP-hard. 

2.5.3 Connections with Other Results 

• In the neural network and computational learning theory communities results 
have been obtained pertaining to the issues of generalization and learnability. 
Some theoretical work has been done (Baum and Haussler, 1989; Haussler, 1989; 
Ji and Psaltis, 1992) in characterizing the sample complexity of finite sized net- 
works. Of these, it is worthwhile to mention again the work of Haussler from 
which this chapter derives much inspiration. He obtains bounds for a fixed 
hypothesis space i.e. a fixed finite network architecture. Here we deal with 
families of hypothesis spaces using richer and richer hypothesis spaces as more 
and more data becomes available. Later we will characterize the trade-off be- 
tween hypothesis complexity and error rate. Others (Levin, Tishby, and Solla, 
1990; Opper, and Haussler, 1991) attempt to characterize the generalization 
abilities of feed-forward networks using theoretical formalizations from statisti- 
cal mechanics. Yet others (Botros, and Atkeson, 1991; Moody, 1992; Cohn and 
Tesauro, 1991; Weigand, Rumelhart, and Huberman, 1991) attempt to obtain 
empirical bounds on generalization abilities. 

• This is an attempt to obtain rate-of-convergence bounds in the spirit of Barron's 
work , but using a different approach. We have chosen to combine theorems from 
approximation theory (which gives us the 0(l/n) term in the rate, and uniform 
convergence theory (which gives us the other part). Note that at this moment, 
our rate of convergence is worse than Barron's. In particular, he obtains a 
rate of convergence of 0(l/n + (nkln(l))/l). Further, he has a different set of 
assumptions on the class of functions (corresponding to our J 7 ). Finally, the 
approximation scheme is a class of networks with sigmoidal units as opposed to 
radial-basis units and a different proof technique is used. It should be mentioned 
here that his proof relies on a discretization of the networks into a countable 
family, while no such assumption is made here. 
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• It would be worthwhile to make a reference to (Geman, Bienenstock, and Dour- 
sat, 1992) which talks of the Bias- Variance dilemma. This is another way of 
formulating the trade-off between the approximation error and the estimation 
error. As the number of parameters (proportional to n) increases, the bias 
(which can be thought of as analogous to the approximation error) of the esti- 
mator decreases and its variance (which can be thought of as analogous to the 
estimation error) increases for a fixed size of the data set. Finding the right 
bias-variance trade-off is very similar in spirit to finding the trade-off between 
network complexity and data complexity. 

• Given the class of radial basis functions we are using, a natural comparison 
arises with kernel regression (Krzyzak, 1986; Devroye, 1981) and results on the 
convergence of kernel estimators. It should be pointed out that, unlike our 
scheme, Gaussian-kernel regressors require the variance of the Gaussian to go 
to zero as a function of the data. Further the number of kernels is always equal 
to the number of data points and the issue of trade-off between the two is not 
explored to the same degree. 

• In our statement of the problem, we discussed how pattern classification could be 
treated as a special case of regression. In this case the function f corresponds to 
the Bayes a-posteriori decision function. Researchers (Richard, and Lippman, 
1991; Hampshire, and Pearlmutter, 1990; Gish, 1990) in the neural network 
community have observed that a network trained on a least square error criterion 
and used for pattern classification was in effect computing the Bayes decision 
function. This chapter provides a rigorous proof of the conditions under which 
this is the case. 

2.6 Implications of the Theorem in Practice: Putting 
In the Numbers 

We have stated our main result in a particular form. We have provided a provable 
upper bound on the error (in the || . ||l 2 (p) metric) in terms of the number of examples 
and the number of basis functions used. Further we have provided the order of the 
convergence and have not stated the constants involved. The same result could be 
stated in other forms and has certain implications. It provides us rates at which 
the number of basis functions (n) should increase as a function of the number of 
examples (/) in order to guarantee convergence(Section 2.6.1). It also provides us 
with the trade-offs between the two as explored in Section 2.6.2. 
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2.6.1 Rate of Growth of n for Guaranteed Convergence 

From our theorem (2.4.1) we see that the generalization error converges to zero only 
if n goes to infinity more slowly than /. In fact, if n grows too quickly the estimation 
error cu(7, n, 6) will diverge, because it is proportional to n. In fact, setting n = T, we 
obtain 

\imi_> +00 u(l,n,5) = 

= lim^ +TO 0([^M!^l±MlM] 1/2 

= lim/^ +00 f" 1 In / . 
Therefore the condition r < 1 should hold in order to guarantee convergence to zero. 

2.6.2 Optimal Choice of n 

In the previous section we made the point that the number of parameters n should 
grow more slowly than the number of data points /, in order to guarantee the consis- 
tency of the estimator f n j. It is quite clear that there is an optimal rate of growth of 
the number of parameters, that, for any fixed amount of data points /, gives the best 
possible performance with the least number of parameters. In other words, for any 
fixed / there is an optimal number of parameters n*(l) that minimizes the general- 
ization error. That such a number should exist is quite intuitive: for a fixed number 
of data, a small number of parameters will give a low estimation error cu(7, n, <5), but 
very high approximation error e(n), and therefore the generalization error will be 
high. If the number of parameters is very high the approximation error e(n) will be 
very small, but the estimation error cu(7, n, 6) will be high, leading to a large gener- 
alization error again. Therefore, somewhere in between there should be a number of 
parameters high enough to make the approximation error small, but not too high, so 
that these parameters can be estimated reliably, with a small estimation error. This 
phenomenon is evident from figure (2-7), where we plotted the generalization error as 
a function of the number of parameters n for various choices of sample size /. Notice 
that for a fixed sample size, the error passes through a minimum. Notice that the 
location of the minimum shifts to the right when the sample size is increased. 

In order to find out exactly what is the optimal rate of growth of the network size 
we simply find the minimum of the generalization error as a function of n keeping the 
sample size / fixed. Therefore we have to solve the equation: 
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Bound on the 



!;ene r a 1 I za t i on error 




■h in 



cti . 




Figure 2-7: Bound on the generalization error as a function of the number of basis 
functions n keeping the sample size / fixed. This has been plotted for a few different 
choices of sample size. Notice how the generalization error goes through a minimum 
for a certain value of n. This would be an appropriate choice for the given (constant) 
data complexity. Note also that the minimum is broader for larger /, that is, an 
accurate choice of n is less critical when plenty of data is available. 
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On 







for n as a function of /. Substituting the bound given in theorem (2.4.1) in the 
previous equation, and setting all the constants to 1 for simplicity, we obtain: 



d_ 
dn 



1 
- + 

n 



,nkln(nl) — ln(6), i 

> 1 ] ~ 2 



. 



Performing the derivative the expression above can be written as 



n^ 



knln(nl) — In 6 



I 



2 k 



I 



[In(nl) + 1] . 



We now make the assumption that / is big enough to let us perform the approximation 
ln(n/) + 1 ss ln(n/). Moreover, we assume that 

7 « {nl) nk 
b 

in such a way that the term including 6 in the equation above is negligible. After some 

algebra we therefore conclude that the optimal number of parameters n*(l) satisfies, 

for large /, the equation: 



f(l) 



4/ 



kln(n*(I)I)_ 

From this equation is clear that n* is roughly proportional to a power of /, and 
therefore we can neglect the factor n* in the denominator of the previous equation, 
since it will only affect the result by a multiplicative constant. Therefore we conclude 
that the optimal number of parameters n*(l) for a given number of examples behaves 



as 



f(l) 



oc 



klnl 



(2.15) 



In order to show that this is indeed the optimal rate of growth we reported in figure 
(2-8) the generalization error as function of the number of examples / for different 
rate of growth of n, that is setting n = l r for different values of r. Notice that the 
exponent r = |, that is very similar to the optimal rate of eq. (2.15), performs better 
than larger (r = |) and smaller (r = -^) exponents. 

While a fixed sample size suggests the scheme above for choosing an optimal network 
size, it is important to note that for a certain confidence rate (6) and for a fixed error 
rate (e), there are various choices of n and / which are satisfactory. Fig. 2-9 shows n 
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n = l 



n = l 



n = l 



I (number of examples) 

Figure 2-8: The bound on the generalization error as a function of the number of 
examples for different choices of the rate at which network size n increases with 
sample size /. Notice that if n = /, then the estimator is not guaranteed to converge, 
i.e., the bound on the generalization error diverges. While this is a distribution free- 
upper bound, we need distribution-free lower bounds as well to make the stronger 
claim that n = I will never converge. 
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Figure 2-10: The generalization error as a function of number of examples keeping the 
number of basis functions (n) fixed. This has been done for several choices of n. As 
the number of examples increases to infinity the generalization error asymptotes to 
a minimum which is not the Bayes error rate because of finite hypothesis complexity 
(finite n). 
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Figure 2-11: The generalization error, the number of examples (/) and the number of 
basis functions (n) as a function of each other. 



2.7 Conclusion 

For the task of learning some unknown function from labelled examples where we 
have multiple hypothesis classes of varying complexity, choosing the class of right 
complexity and the appropriate hypothesis within that class poses an interesting 
problem. We have provided an analysis of the situation and the issues involved and in 
particular have tried to show how the hypothesis complexity, the sample complexity 
and the generalization error are related. We proved a theorem for a special set of 
hypothesis classes, the radial basis function networks and we bound the generalization 
error for certain function learning tasks in terms of the number of parameters and 
the number of examples. This is equivalent to obtaining a bound on the rate at 
which the number of parameters must grow with respect to the number of examples 
for convergence to take place. Thus we use richer and richer hypothesis spaces as 
more and more data become available. We also see that there is a tradeoff between 
hypothesis complexity and generalization error for a certain fixed amount of data and 
our result allows us a principled way of choosing an appropriate hypothesis complexity 
(network architecture). The choice of an appropriate model for empirical data is a 
problem of long-standing interest in statistics and we provide connections between 
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our work and other work in the held. 

2-A Notations 

• A: a set of functions dehned on S such that, for any a £ A } 

< a(C) <U 2 VC £ S . 

• Af the restriction of A to the data set, see eq. (2.23). 

• B: it will usually indicate the set of all possible /-dimensional Boolean vectors. 

• B: a generic e-separated set in S. 

• C(e, A, dpi): the metric capacity of a set A endowed with the metric o?i,i(p). 

• o?(-, •): a metric on a generic metric space S. 

• dpi(- } -) } dpitp\(- } ■): L 1 metrics in vector spaces. The definition depends on 
the space on which the metric is dehned (k-th dimensional vectors, real valued 
functions, vector valued functions). 

f . In a vector space R k we have 

1 11=1 

where x, y £ R k , x^ and y^ denote their //-th components. 

2. In an infinite dimensional space T of real valued functions in k variables 
we have 



dmp)(f,g)= I |/(x)- 5 (x)|JP(x) 

JR k 



IR k 



where /, g £ T and dP(x.) is a probability measure on R 
3. In an inhnite dimensional space T of functions in k variables with values 



in R n 



we nave 



1 n r 
d L i (P )(f,g) = -E / , |/,-(x) - ^(x)|dP(x) 
n ~[ JR k 

where f(x) = (/i(x), . . . /,-(x), . . . /„(x)), g(x) = (flfi(x), . . . flf.-(x), . . . flf„(x)) 
are elements of T and dP(x.) is a probability measure on R k . 
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• Df. it will always indicate a data set of / points: 

A = {(x,-,3/0<EXxy}j =1 . 
The points are drawn according to the probability distribution P(x, y). 

• E[-]\ it denotes the expected value with respect to the probability distribution 
P(x,j/). For example 

I[f] = E[(y - /(x)) 2 ] , 
and 

ll/o-/||i 2( P)=P[(/o(x)-/(x)) 2 ]. 

• /: a generic estimator, that is any function from X to Y: 

f:X^-Y. 

• /o(x): the regression function, it is the conditional mean of the response given 
the predictor: 



/o(x) = j dy yP(y\x) . 



It can also be defined as the function that minimizes the expected risk /[/] in 
ti, that is 



/o(x) = arg inf /[/] . 

J-fc u 

Whenever the response is obtained sampling a function h in presence of zero 
mean noise the regression function coincides with the sampled function h. 

f n : it is the function that minimizes the expected risk /[/] in H n : 



fn = arg inf /[/] 
Since 



66 



I[f\ = \\fo-f\\h {P) + I[fo] 

f n it is also the best L 2 (P) approximation to the regression function in H n (see 
figure 2-6). 

f n ,V- is the function that minimizes the empirical risk J emp [/] in H n - 



fn,i = arg inf I emp [f] 

In the neural network language it is the output of the network after training 
has occurred. 

• T: the space of functions to which the regression function belongs, that is the 
space of functions we want to approximate. 

where X £ R d and Y £ R. T could be for example a set of differentiable 
functions, or some Sobolev space H m,p (R k ) 

• Q: it is a class of functions of k variables 

g:R k ^%V] 
dehned as 

g=={g:g(x) = G(\\x-t\\), t £ R k }. 

where G is the gaussian function. 

• G\. it is a k + 2-dimensional vector space of functions from R k to R dehned as 

Gi = spanjl, x , x , •, x , ||x|| j 

where x £ R k and x^ is the //-th component of the vector x. 

• G-2'- it is a set of real valued functions in k variables dehned as 

1 
G2 = {ae : / £ Gi, a = , — } 

V27Tcr 
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where a is the standard deviation of the Gaussian G. 

• Hj: it is a class of vector valued functions 

g(x) :R k ^R n 
of the form 

g(x) = (G(||x-t 1 ||),G(||x-t 2 ||),...,G(||x-t n ||)) 
where G is the gaussian function and the t 8 - are arbitrary ^-dimensional vectors. 

• Hp\ it is a class of real valued functions in n variables: 



f:[0,V] n ^R 



of the form 



/(x) = (3 • x 

where (3 = (/3i, . . . , f3 n ) is an arbitrary n-dimensional vector that satisfies the 
constraint 

n 

8 = 1 

• H n : a subset of J 7 , whose elements are parametrized by a number of parameters 
proportional to n. We will assume that the sets H n form a nested family, that 
is 

H x C H 2 C . . . C H n C . . . . 

For example H n could be the set of polynomials in one variable of degree n — 1, 
Radial Basis Functions with n centers or multilayer perceptrons with n hidden 
units. Notice that for Radial Basis Functions with moving centers and Multi- 
layer perceptrons the number of parameters of an element of H n is not n, but it 
is proportional to n (respectively n(k + 1) and n(k + 2), where k is the number 
of variables). 
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• H: it is defined as H = U^Li H n , and it is identified with the approximation 
scheme. If H n is the set of polynomials in one variable of degree n — 1 } H is the 
set of polynomials of any degree. 

• H m,p (R k ): the Sobolev space of functions in k variables whose derivatives up to 
order m are in L p (R k ). 

• /[/]: the expected risk, defined as 



/[/]= / JxJj/P(x,j/)(j/-/(x)) 2 . 

JXxY 



where / is any function for which this expression is well defined. It is a measure 
of how well the function / predicts the response y. 

J emp [/]: the empirical risk. It is a functional on U defined as 



1 ' 



Jemp[/] = yZ)(^'-/( X 0) 

8 = 1 



where {(x 8 -, j/i)}' =1 is a set of data randomly drawn from X xY according to the 
probability distribution P(x, y). It is an approximate measure of the expected 
risk, since it converges to /[/] in probability when the number of data points / 
tends to infinity. 

• k: it will always indicate the number of independent variables, and therefore 
the dimensionality of the set X. 

• I: it will always indicate the number of data points drawn from X according to 
the probability distribution -P(x). 

• L 2 (P): the set of function whose square is integrable with respect to the measure 
defined by the probability distribution P. The norm in L 2 (P) is therefore 
defined by 



h(p) 



I ixP(x)/ 2 (x) 

JR k 



• A m (R k )(M , Mi, Af 2 , . . . , Mm): the space of functions in k variables whose deriva- 



tives up to order m are bounded: 



D a f\<M H M = l,2 



, . . . , 



m 
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where a is a multi-index. 

• M: a bound on the coefficients of the gaussian Radial Basis Functions technique 
considered in this paper, see eq. (2.13). 

• A4(e, <S, d): the packing number of the set <S, with metric d. 

• A/"(e, <S, d): the covering number of the set <S, with metric d. 

• n: a positive number proportional to the number of parameters of the approx- 
imating function. Usually will be the number of basis functions for the RBF 
technique or the number of hidden units for a multilayer perceptron. 

• -P(x): a probability distribution defined on X. It is the probability distribution 
according to which the data are drawn from X. 

• P(y\x): the conditional probability of the response y given the predictor x. It 
represents the probabilistic dependence of y from x. If there is no noise in the 
system it has the form P(y\x) = 6(y — h(x.)), for some function h, indicating 
that the predictor x uniquely determines the response y. 

• P(x, y): the joint distribution of the predictors and the response. It is a prob- 
ability distribution on X xY and has the form 

P(x,j/) = P(x)P(j/|x). 

• S: it will usually denote a metric space, endowed with a metric d. 

• S: a generic subset of a metric space S. 

• T: a generic e-cover of a subset S C S. 

• U: it gives a bound on the elements of the class A. In the specific case of the 
class A considere in the proof we have U = 1 + MV . 

• U: the set of all the functions from X to Y for which the expected risk is well 
defined. 

• V: a bound on the Gaussian basis function G: 

< G(x) < V , Vx g R k ■ 
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• X: a subset of R , not necessarily proper. It is the set of the independent 
variables, or predictors, or, in the language of neural networks, input variables. 

• x: a generic element of X, and therefore a ^-dimensional vector (in the neural 
network language is the input vector). 

• Y: a subset of P, whose elements represent the response variable, that in the 
neural networks language is the output of the network. Unless otherwise stated 
it will be assumed to be compact, implying that T is a set of bounded functions. 
In pattern recognition problem it is simply the set {0, 1}. 

• y: a generic element of Y, it denotes the response variable. 

2-B A Useful Decomposition of the Expected Risk 

We now show that the function that minimizes the expected risk 



/[/]= / P(x,j/)JxJj/(j/-/(x)) 2 

■JXxY 



is the regression function defined in eq. (2.3). It is sufficient to add and subtract the 
regression function in the definition of expected risk: 

I[f] =/xxy^Jj/P(x,j/)(j/-/ (x) + /o(x)-/(x)) 2 = 

= IxxY d ^ d y p (^y)(y- /o(*)) 2 + 

+ J XxY dxdyP(x, j/)(/o(x) - /(x)) 2 + 

+ 2 J XxY <*x<fyP(x, y)(y - /o(x))(/ (x) - /(x)) 

By definition of the regression function /o(x), the cross product in the last equation 
is easily seen to be zero, and therefore 



/[/]= / JxP(x)(/o(x)-/(x)) 2 + /[/ ] 



Since the last term of /[/] does not depend on /, the minimum is achieved when the 
hrst term is minimum, that is when /(x) = /o(x). 

In the case in which the data come from randomly sampling a function / in 
presence of additive noise, e, with probability distribution V(e) and zero mean, we 
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have P(j/|x) = V(y — /(x)) and then 

I[f ] = [ <*x<fyP(x, y)(y - / (x)) 2 = (2.16) 

■JXxY 

= [ JxP(x) [ (y- f(x)) 2 V(y - /(x)) = (2.17) 

= i dxP(x) / e 2 V{e)de = a 2 (2.18) 

where a 2 is the variance of the noise. When data are noisy, therefore, even in the most 
favourable case we cannot expect the expected risk to be smaller than the variance 
of the noise. 



2-C A Useful Inequality 

Let us assume that, with probability 1 — 6 a uniform bound has been established: 

|/ emp [/]-/[/]|<u;(/,ra,£) WfeH n . 
We want to prove that the following inequality also holds: 

\I[fn]-I[fn,i}\<2u(l,n,6). (2.19) 

This fact is easily established by noting that since the bound above is uniform, then 
it holds for both f n and f n j } and therefore the following inequalities hold: 

I[fn,l] < Lp[/»,l] + ^ 
Lp[/J < I[fn] +U 

Moreover, by definition, the two following inequalities also hold: 



I[fn] < /[/, 



n,l\ 



-'emp[JVi,/J _ -'emp[JViJ 

Therefore tha following chain of inequalities hold, proving inequality (2.19): 

I[fn] < I[fn,l] < Iemp[fn,l] + ^< Iemp[fn] +^< I[fn] + ^ ■ 

An intutitive explanation of these inequalities is also explained in figure (2-12) 
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I[f n ] Itf ^ ] 



2e 2e 



'emp L In J 'emp [ fn,l 



Figure 2-12: If the distance between I[f n ] and I[f n ,i] is larger than 2e, the condition 
Lp[/ n ,i] < U P [/ n ] is violated. 



2-D Proof of the Main Theorem 

The theorem will be proved in a series of steps. Conceptually, there are four major 
steps in the proof outlined in the proof structure below. 

Structure of Proof 

Step 1 

The total generalization error is decomposed into its approximation and estimation 
components. Using the derivations outlined in appendices 2-B, and 2-C, we are able 
to show that the decomposition has the form of statement 2.2.1 of section 2.2, viz., 
with probability 1 — 6, 

ll/o " ki\\h(P) < <n) + Mh n, 6) . (2.20) 

We now need to compute e(n) and u(l, n, 6) and these constitute steps 2 and 3 of the 
proof structure. 
Step 2 

We obtain a bound on e(n) (the approximation error) in section 2-D.l. The funda- 
mental lemma used here is the Maurey- Jones-Barron lemma (Lemma 2-D.l) and the 
approximation bound is obtained. 
Step 3 

We obtain a bound on the estimation error u(l, n, 6) in section 2-D. 2. Recall that we 
need to be able to prove a uniform law of large numbers of the form: 

V/ G H n , \I[f]-I e mp[f]\<u,(l,n,8) 
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with probability greater than 1—6. 

Starting with a uniform law of the form stated in Claim 2-D.l and refining it 

further we arrive at Claim 2-D.3. In doing this, we introduce notions of covering 

numbers and metric entropy. The form of this refined uniform law of large numbers 

is: 

P(VheH n ,\I emp [h]-I[h]\<e)> 



28 t/ 4 



> l-AC{e/lG,A,d L i)]e-T 

In order to let I — 4C(e/16, A } djj. )]e~ ' i2su^ € be greater than I — <5, we need to obtain 
an expression for C(e/16, A } d L i )] in terms of the number of parameters. Claims 2-D.4 
through 2-D.9 go through this computation. 

Finally, in claim 2-D.IO, we show how to use this result to compute an expression 
for cu(7, n, 6) which is what we originally set out to do. 
Step 4 

Putting together the approximation and estimation bounds of steps 2 and 3, we 
obtain in section 2-D.3 how the expression for the total generalization error in the 
appropriate form in order to prove the main theorem. 

2-D.l Bounding the approximation error 

In this part we attempt to bound the approximation error. In section 2.3 we assumed 
that the class of functions to which the regression function belongs, that is the class 
of functions that we want to approximate, is 

T = {/|/ = A * G m , m > k/2, \X\ Rk <M}. 

where A is a signed Radon measure on the Borel sets of R k , G m is the Bessel- 
Macdonald kernel as defined in section 2.3 and M is a positive real number. Our 
approximating family is the class: 

71 llv — t-ll n 
H n = {feL 2 \f = J2fcG(- -), EIAI< M > U<ER k } 

8 = 1 ' 8 = 1 

It has been shown in [49, 50] that the class H n uniformly approximate elements of J 7 , 
and that the following bound is valid: 



E[(f ~ fnf] < O Q . 



(2.2r 



This result is based on a lemma by Jones [71] on the convergence rate of an 
iterative approximation scheme in Hilbert spaces. A formally similar lemma, brought 
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to our attention by R. Dudley [38] is due to Maurey and was published by Pisier 
[105]. Here we report a version of the lemma due to Barron [8, 9] that contains a 
slight refinement of Jones' result: 

Lemma 2-D.l (Maurey-Jones-Barron) If f is in the closure of the convex hull 
of a set Q in a Hilbert space H with \\g\\ < b for each g £ Q , then for every n > 1 
and for c > b 2 — \\f\\ 2 there is a f n in the convex hull of n points in Q such that 

v-uf<-. 

n 



In order to exploit this result one needs to define suitable classes of functions which 
are the closure of the convex hull of some subset Q of a Hilbert space H . One way 
to approach the problem consists in utilizing the integral representation of functions. 
Suppose that the functions in a Hilbert space H can be represented by the integral 

/(x)= / G m (x;t)Ja(t) (2.22) 

J M 
where a is some measure on the parameter set A4, and G m (x;t) is a function of H 
parametrized by the parameter t, whose norm ||Cr m (x;t)|| is bounded by the same 
number for any value of t. In particular, if we let G m (x;t) be translates of G m by 
t, i.e., G m (x — t), and a be a finite measure, the integral (2.22) can be seen as an 
infinite convex combination of translates of G m . 

We now make the following two observations. First, it is clear that elements of 
T have an integral representation of the type (2.22) and are members of the Hilbert 
space H. Second, since A is a finite measure (bounded by M) elements of T are infinite 
convex combinations of translates of G m . We now make use of the important fact that 
convex combinations of translates of G m can be represented as convex combinations 
of translates and dilates of Gaussians (in other words sets of the form of H n for some 
n). 

This allows us to define Q of lemma 2-D.l to be the parametrized set Q = {g\g(x.) = 
G( " x ~ " )}. Clearly, elements of T lie in the convex hull of Q as defined above and 
therefore, applying lemma (2-D.l) one can prove ([49, 50]) that there exist n coeffi- 
cients Q, n parameter vectors t 8 -, and n choices for cr 8 - such that 

n i 

HZ-E^toti^Oir^oc-) 

~: n 

8 = 1 

Notice that the bound (2.21), that is similar in spirit to the result of A. Barron 
on multilayer perceptrons [8, 10], is interesting because the rate of convergence does 
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not depend on the dimension d of the input space. This is apparently unusual in 
approximation theory, because it is known, from the theory of linear and nonlinear 
widths [122, 104, 88, 89, 32, 31, 33, 92], that, if the function that has to be approxi- 
mated has d variables and a degree of smoothness s, we should not expect to find an 
approximation technique whose approximation error goes to zero faster than 0(n~^). 
Here "degree of smoothness" is a measure of how constrained the class of functions 
we consider is, for example the number of derivatives that are uniformly bounded, or 
the number of derivatives that are integrable or square integrable. Therefore, from 
classical approximation theory, we expect that, unless certain constraints are imposed 
on the class of functions to be approximated, the rate of convergence will dramatically 
slow down as the number of dimensions increases, showing the phenomenon known 
as "the curse of dimensionality" [13]. 

In the case of class T we consider here, the constraint of considering functions 
that are convolutions of Radon measures with Gaussians seems to impose on this 
class of functions an amount of smoothness that is sufficient to guarantee that the 
rate of convergence does not become slower and slower as the dimension increases. A 
longer discussion of the "curse of dimensionality" can be found in [50]. 

We notice also that, since the rate (2.21) is independent of the dimension, the 
class J 7 , together with the approximating class H n} defines a class of problems that 
are "tractable" even in a high number of dimensions. 

2-D.2 Bounding the estimation error 

In this part we attempt to bound the estimation error |/[/] — / e m P [/]|- In order to do 
that we hrst need to introduce some basic concepts and notations. 

Let S be a subset of a metric space S with metric d. We say that an e-cover with 
respect to the metric d is a set T £ S such that for every s £ <S, there exists some 
t £ T satisfying d(s, t) < e. The size of the smallest e-cover is A/"(e, <S, d) and is called 
the covering number of S. In other words 

Af(e, S, d) = min \T\ , 

where T runs over all the possible e-cover of S and \T\ denotes the cardinality of T . 
A set B belonging to the metric space S is said to be e-separated if for all 
x, y £ B } d(x } y) > e. We define the the packing number Ai(e } S } d) as the size of the 
largest e-separated subset of S. Thus 

A4(e, S, d) = max \B\ , 

v ) BcS I I 
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where B runs over all the e-separated subsets of S . It is easy to show that the covering 
number is always less than the packing number, that is A/"(e, <S, d) < Ai(e } <S, d). 

Let now -P(£) be a probability distribution defined on S, and A be a set of real- 
valued functions defined on S such that, for any a £ *4, 

< a(0 < t/ 2 VC £ S . 

Let also £ = (£1, ..,£;) be a sequence of / examples drawn independently from S ac- 
cording to -P(£). For any function a £ A we define the empirical and true expectations 
of a as follows: 

£H = yi>(6) 

8 = 1 



E[a] = J s dCP(Oa(0 



The difference between the empirical and true expectation can be bounded by the 
following inequality, whose proof can be found in [1 10] and [62], that will be crucial 
in order to prove our main theorem. 

Claim 2-D.l ([110], [62]) Let A and £ be as defined above. Then, for all e > 0, 

P(3a£ A: \E[a] - E[a]\ > e) < 

<4E[Af(^A^d L1 )]e-T 2 



_J e 2, 

128C/4 



In the above result, At is the restriction of A to the data set, that i 



u, ^ 



is: 



Az={(a(Z 1 ),...,a(Z l )):aeA}. (2.23) 

The set A^ is a collection of points belonging to the subset [0, U] 1 of the /-dimensional 
euclidean space. Each function a in A is represented by a point in A^ while every 
point in A^ represents all the functions that have the same values at the points 
£i, . . . ,£/. The distance metric dji in the inequality above is the standard L 1 metric 
in R l , that is 

4.(x,y) = ^|/-/| 

where x and y are points in the /-dimensional euclidean space and x^ and y^ are their 
/i-th components respectively. 
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The above inequality is a result in the theory of uniform convergence of empirical 
measures to their underlying probabilities, that has been studied in great detail by 
Pollard and Vapnik, and similar inequalities can be found in the work of Vapnik 
[126, 127, 125], although they usually involve the VC dimension of the set A } rather 
than its covering numbers. 

Suppose now we choose S = X X Y, where X is an arbitrary subset of R k and 
Y = [— M } M] as in the formulation of our original problem. The generic element of 
S will be written as( = (x,i/)£lx}'. We now consider the class of functions A 
defined as: 

A={a:XxY^R\ a(x, y) = (y - h(x)) 2 , h G H n (R k )} 

where H n (R k ) is the class of ^-dimensional Radial Basis Functions with n basis func- 
tions defined in eq. 2.13 in section 2.3. Clearly, 

\y-H*)\< M+|/i(x)| <M + MV, 
and therefore 



< a < U 2 



where we have defined 



U = M + MV . 
We notice that, by definition of E(a) and E(a) we have 



1 ' 



E(a) = 1 J2(y l -Hx l )) 2 = I emp [h} 



8 = 1 

and 

E(a) = I dxdy P(x, y)(y - /.(x)) 2 = I[h] . 
■JXxY 

Therefore, applying the inequality of claim 2-D.l to the set A } and noticing that the 

elements of A are essentially defined by the elements of H n , we obtain the following 

result: 
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P(VheH n ,\I emp [h]-I[h]\<e)> 

(2.24) 
> 1 - 4:E[N{e/l6,A^d L i)]e-Tiu^ 21 . 

so that the inequality of claim 2-D.l gives us a bound on the estimation error. How- 
ever, this bound depends on the specific choice of the probability distribution P(x, j/), 
while we are interested in bounds that do not depend on P. Therefore it is useful to 
define some quantity that does not depend on P, and give bounds in terms of that. 
We then introduce the concept of metric capacity of A } that is defined as 

C(e,A,d L i) = sup{Af(e,A,d L i( P ))} 



where the supremum is taken over all the probability distributions P defined over S, 
and d L ii P \ is standard 7 X (P) distance 12 

induced by the probability distribution P: 



dpi(p)(a 1} a 2 ) = ^P(Ol a i(0 ~ °2(0I a 1} a 2 eA. 

The relationship between the covering number and the metric capacity is showed in 
the following 

Claim 2-D.2 

E[M(e,Az,d L i)]<C(e,A,d L i) . 

Proof: For any sequence of points £ in S, there is a trivial isometry between (A^ } dpi ) 
and (A } o?£i(p-)) where P^ is the empirical distribution on the space S given by 
jJ2i=iHC ~ £«')• Here 8 is the Dirac delta function, ^ £ 5, and £,- is the z-th el- 
ement of the data set. To see that this isometry exists, first note that for every 
element a £ A, there exists a unique point (a(£i), . . . , a(6)) G A^. Thus a simple 
bijective mapping exists between the two spaces. Now consider any two elements g 
and h of A. The distance between them is given by 



12 Note that here A is a class of real-valued functions defined on a general metric space S. If we 

consider an arbitrary A defined on S and taking values in R n , the c?li(p), norm is appropriately 

adjusted to be 

1 " f 
d L1(P) ({, g) = - J2 / l/*( x ) " fl,-(x)|P(x)dx 



-\ 



where f(x) = (/i(x), . . ./ 8 (x), . . ./„(x)), g(x) = (fifi(x), . . .0;(x), . . .fif„(x)) are elements of A and 
_P(x) is a probability distribution on S. Thus d^i and rf^up) should be interpreted according to the 
context. 
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diAw(g,h) = I \ g (0 - HOmOdC = y £ Wi) - M6)l- 

This is exactly what the distance between the two points (g(Ci)i ••, </(6)) an d (M£i)' ••■> M6))? 
which are elements of A^ is according to the dpi distance. Thus there is a one-to-one 
correspondence between elements of A and A^ and the distance between two elements 
in A is the same as the distance between their corresponding points in A^. Given 
this isometry, for every e-cover in A } there exists an e-cover of the same size in A^ 
so that 

J\f(e, A b d L i ) = J\f(e, A, d L i (Pd ) < C(e, A, d L i). 
and consequently E[Af(e, A^ dpi )] < C(e, A } dpi). □ 

The result above, together with eq. (2.24) shows that the following proposition holds: 

Claim 2-D.3 

P(VheH n ,\I einp [h]-I[h]\<e)> 

(2.25) 
> 1 - 4C(e/16, A, d L i )]e"T^ e2/ . 

Thus in order to obtain a uniform bound u on |/ e m P [^] — I[h]\ } our task is reduced to 
computing the metric capacity of the functional class A which we have just defined. 
We will do this in several steps. In Claim 2-D.4, we hrst relate the metric capacity of 
A to that of the class of radial basis functions H n . Then Claims 2-D.5 through 2-D.9 
go through a computation of the metric capacity of H n . 

Claim 2-D.4 

C(e,A,d L i)<C(e/AU,H n ,d L i) 

Proof: Fix a distribution P on S = X X Y. Let Px be the marginal distribution with 
respect to X. Suppose K is an e/4[/-cover for H n with respect to this probability 
distribution Px } i.e. with respect to the distance metric o?£i(p x ) on H n . Further let 
the size of K be A/"(e/4t/, H n , d L iip\). This means that for any h £ H n , there exists 
a function h* belonging to K } such that: 



I |/i(x) - /i*(x)|Px-(x)rfx < e/iU 



Now we claim the set H(K) = {(y — /i(x)) 2 : h £ K} is an e cover for A with respect 
to the distance metric g?li(p). To see this, it is sufficient to show that 
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/ \(y - Mx)) 2 - (y - ^(x)) 2 |P(x, y)dxdy < 

< 52\{2y -h- h*)\\(h - h*)\P{*,y)dxdy < 

< f 2(2M + 2MV)\h - h*\P(x,y)dxdy < e 
which is clearly true. Now 

M(e,A,d LHP) )<\H(K)\ = 

= Af(e/iU,H n ,d Ll{Px) )< 

<C(e/AU,H n ,d L i) 
Taking the supremum over all probability distributions, the result follows. □ 

So the problem reduces to finding C(e, H n} o?£i), i.e. the metric capacity of the class 
of appropriately defined Radial Basis Functions networks with n centers. To do this 
we will decompose the class H n to be the composition of two classes defined as follows. 

Definitions/Notations 

Hi is a class of functions defined from the metric space (R k , d L i ) to the metric space 
(i? n ,J^i). In particular, 

Hi = {g(x) = (G( ), G{ ), . . . , G{ ))} 

where G is a Gaussian and t 8 - are ^-dimensional vectors. 

Note here that G is the same Gaussian that we have been using to build our Radial- 
Basis-Function Network. Thus Hj is parametrized by the n centers t{ and the n 
variances of the Gaussians af, in other words n(k + I) parameters in all. 

Hp is a class defined from the metric space ([0,V] n , dpi) to the metric space 
(R } di,i). In particular, 

n 

H F = |>(x) = (3 ■ x, x G [0, V] n and ^ \(3 t \ < M} 

i=l 

where (3 = (/3i, . . . , f3 n ) is an arbitrary n-dimensional vector. 
Thus we see that 
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H n = {hp o hj : hp £ Hp and hj £ _H/} 

where o stands for the composition operation, i.e., for any two functions / and g, 
f o g = /(<7(x)). It should be pointed out that H n as defined above is defined from 
R k to R. 

Claim 2-D.5 

/2eV /2eV\\ n ( k+2 ^ 
C(e,H I ,d IA )<2 n (—\n r2eV ^ 



Proof: Fix a probability distribution P on R k . Consider the class 

Q = {g : </(x) = G(^^), t £ A fc ; a £ A}. 
Let K be an A/"(e, (?, e?£,i(p))-sized e cover for this class. We first claim that 

T={(h 1 ,..,h n ):h i eK} 

is an e-cover for Hj with respect to the d L ii P \ metric. 

Remember that the o?i,i(p) distance between two vector- valued functions g(x) = 
(flfi(x),..,flf„(x)) and g*(x) = (g{ (x), .., fl£(x)) is dehned as 

I n f 
dv(p)(g,g*) = ~Z) / k'( x )-#*( x )l P ( x ¥ x 

8 = 1 

To see this, pick an arbitrary g = (</i, . . . , g n ) £ Ap For each g 8 -, there exists a g* £ A^ 
which is e-close in the appropriate sense for real- valued functions, i.e. d L i t P \ (g 8 - , g* ) < 
e. The function g = (<j£, ..,<?*) is an element of T. Also, the distance between 
(gi,..,g n ) and (g$,..,g*) in the g? l i(p) metric is 

I n 
diA(p)(g,g*) < -Z e = e • 

n 8 =i 

Thus we obtain that 

M^H^dvip)) <[M(e,g,d LHP) )] n 
and taking the supremum over all probability distributions as usual, we get 

C{e,H h d L ,)<{C{e,g,d L ,)) n . 

Now we need to find the capacity of Q. This is done in the Claim 2-D.6. From this 
the result follows. □ 
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Definitions/Notations 

Before we proceed to the next step in our proof, some more notation needs to be 
defined. Let A be a family of functions from a set S into R. For any sequence 
£ = (£1, ..,£d) of points in S, let A^ be the restriction of T to the data set, as per 
our previously introduced notation. Thus A^ = {(a(£i), • • • , «(6i)) : a £ A}. If there 
exists some translation of the set A^, such that it intersects all 2 d orthants of the 
space R d , then £ is said to be shattered by A. Expressing this a little more formally, 
let B be the set of all possible /-dimensional boolean vectors. If there exists a trans- 
lation t G R d such that for every b G B, there exists some function a^ G A satisfying 
a b(d) ~ ti ^ k ■<=> h{ = I for all i = 1 to d, then the set (^, .., <^) is shattered by A. 
Note that the inequality could easily have been defined to be strict and would not 
have made a difference. The largest d such that there exists a sequence of d points 
which are shattered by A is said to be the pseudo-dimension of A denoted by pdiimA. 
□ 

In this context, there are two important theorems which we will need to use. We give 
these theorems without proof. 

Theorem 2-D.l (Dudley) Let F be a k-dimensional vector space of functions from 
a set S into R. Then pdim(F) = k. 

The following theorem is stated and proved in a somewhat more general form by 
Pollard. Haussler, using techniques from Pollard has proved the specific form shown 
here. 

Theorem 2-D.2 (Pollard, Haussler) Let F be a family of functions from a set 
S into [Mi, M 2 ], where pdim(F) = d for some 1 < d < oo. Let P be a probability 
distribution on S. Then for all < e < M 2 — Mi, 

M(e, F, d L i {P) ) < 2 (-2e(M 2 - Mi) log -2e(M 2 - M{ 

Here Ai(e } F } g?li(p)) is the packing number of F according to the distance metric 
d L i (p) . 



Claim 2-D.6 



C(e,y,d L i) < 2 In 
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Proof: Consider the & + 2- dimensional vector space of functions from R to R defined 

as 

G\ = spanjl, x , x , •, x , ||x|| j 
where x G R k and i* 1 is the //-th component of the vector x. Now consider the class 

G 2 = {Ve~f :feG 1 } 

We claim that the pseudo-dimension of Q denoted by pdim(^) fulfills the following 
inequality, 

pdim ((?) < pdim (G 2 ) = pdim (G\) = (k + 2). 

To see this consider the fact that Q C G 2 - Consequently, for every sequence of points 
x = (xi, . . . , Xd), Q% C {G 2 )^_. Thus if (xi,...,Xd) is shattered by (?, it will be 
shattered by G2. This establishes the first inequality. 

We now show that pdim(G2) < pdim(Gi). It is enough to show that every set shat- 
tered by G2 is also shattered by G\. Suppose there exists a sequence (xi,x 2 , . . . , x^) 
which is shattered by G 2 . This means that by our definition of shattering, there 
exists a translation t G R such that for every boolean vector b G {0, 1} there 
is some function g\y = Ve~^ where f\y G G\ satisfying g\,(xi) > ti if and only if 
bi = 1, where ti and h{ are the z-th components of t and b respectively. First notice 
that every function in G 2 is positive. Consequently, we see that every ti has to be 
greater than 0, for otherwise, jj(x,-) could never be less than ti which it is required 
to be if bi = 0. Having established that every ti is greater than 0, we now show 
that the set (xi, x 2 , . . . , x^) is shattered by G\. We let the translation in this case be 
t' = (log(ti/V),log(t 2 /V), . . . ,log(td/V)). We can take the log since the ti/Vs are 
greater than 0. Now for every boolean vector b, we take the function — /& G G\ and 
we see that since 



9b = Ve~ lb > U &bi = l. 



if follows that 



-f b > iog(*,-/y) = t\ & ^ = 1. 

Thus we see that the set (xi,x 2 , . . . , x^) can be shattered by G\. By a similar argu- 
ment, it is also possible to show that pdim(Gi) > pdim(G2). 
Since G\ is a vector space of dimensionality k-\-2 } an application of Dudley's Theorem 



84 



[37] yields the value k + 2 for its pseudo-dimension. Further, functions in the class Q 
are in the range [0, V]. Now we see (by an application of Pollard's theorem) that 

<2f^lnf^ pdim(e) < 



2eV i /2eV^( fc+2 ) 



<2(^ln 
Taking the supremum over all probability distributions, the result follows. □ 

Claim 2-D.7 

Proof: The proof of this runs in very similar fashion. First note that 

H F C {/3-x:x, (3eR n }. 

The latter set is a vector space of dimensionality n and by Dudley's theorem[37], we 
see that its pseudo-dimension pdim is n. Also, clearly by the same argument as in the 
previous proposition, we have that pdim(Hp) < n. To get bounds on the functions 
in Hp, notice that 

n n n 

IE/^'1 < E IAIN < V J2 IAI < mv. 

i=l i=l i=l 

Thus functions in Hp are bounded in the range [— MV, MV]. Now using Pollard's 
result [62], [110], we have that 

J\f(e,H F ,d L i {P) ) < M(e,H F ,d L i {P) ) < 

^ 9 (iMsK. l n ( 4:MeV \\ n 

Taking supremums over all probability distributions, the result follows. □ 
Claim 2-D.8 A uniform first-order Lipschitz bound of Hp is Mn. 
Proof: Suppose we have x, y £ R n such that 

G?Li(x,y) < e. 
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The quantity Mn is a uniform first-order Lipschitz bound for Hp if, for any element 
of Hp, parametrized by a vector /3, the following inequality holds: 



Now clearly, 



|x-/3-y -/3| < Mne 
|x-/3-y/3| = |£LiA(^'-2/0l< 

<E? = ilAII(*.--y.-)l< 

<MT,7=i\(xi-yi)\<Mne 



The result is proved. □ 



Claim 2-D.9 



C(e,H n ,d L i) < C(———,H I ,d L i)C(-,H F ,d L i 

ZMn 2 



Proof: Fix a distribution P on R k . Assume we have an e/(2Afn)-cover for Hi with 
respect to the probability distribution P and metric d L ii P y Let it be K where 

\K\ = M{tl2Mn,H u d L i (P) ). 

Now each function / £ K maps the space R k into R n , thus inducing a probability 
distribution Pj on the space R n . Specifically, Pj can be defined as the distribution 
obtained from the measure fij defined so that any measurable set A C R n will have 
measure 



4 A) = I P(x)Jx . 



/"A 

Further, there exists a cover Kf which is an e/2-cover for Hp with respect to the 
probability distribution Pj. In other words 

\K f \=N(e/2,H F ,d L i {Pf) ). 
We claim that 

H(K) = {/ o g : g £ K and / £ K g } 



86 



is an e cover for H n . Further we note that 

\H{K)\ = E f ei< \K f \ < E f ei<C(e/2,H F ,d Ll ) < 

< Af(e/(2Mn), E u <* L i (P) )C(e/2, H F , d» ) 

To see that H(K) is an e-cover, suppose we are given an arbitrary function hf o hi £ 
H n . There clearly exists a function h* £ K such that 

d L i(/i t -(x),/i?(x))P(xWx < e/(2Mn) 

R k 

Now there also exists a function h** £ Kh* such that 

f Rk \h f oh* i (x)-h* f oh* i (x)\P(x)dx = 

= f Rn \h f (y) - h%y)\P h *(y)dy < e/2 . 
To show that H(K) is an e-cover it is sufficient to show that 

/ \hfohihc.)- h* f o/i*(x)|P(xWx< e. 

■JR k 

Now 

J Rk \h f o hi(x.) - h* f o /i*(x)|P(x)o?x < 

<!R*{\hfoh i (x.)-h f oh* i (x.)\ + 

+ \h f oh*(x)-h* f oh*(x)\P(x)dx} 
by the triangle inequality. Further, since hf is Lipschitz bounded, 

f Rk \h f o hi(x) - h f o /i*(x)|P(x)o?x < 

< J Rk Mnd L i(h t (x), /i*(x))P(x)Jx < Mn(e/2Mn) < e/2 . 
Also, 

f Rk \h f oh* i (x)-h* f oh* i (x)\P(x)dx = 

= f Rn \h f (y) - h%y)\P h *(y)dy < e/2 . 
Consequently both sums are less than e/2 and the total integral is less than e. Now 



we see 



that 
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M{t, H n , d L i (P) ) < M (e/(2Mn), E u d L i {P) ) C(e/2, H F , d» ). 
Taking supremums over all probability distributions, the result follows. □ 



Having obtained the crucial bound on the metric capacity of the class H n , we can 
now prove the following 

Claim 2-D.10 With probability 1 — 6, and \/h £ H n , the following bound holds: 



|/emp[/l]-/[/l]|<0 



nkln(nl) + ln(l/S) 
I 



l/2\ 



Proof: We know from the previous claim that 



C(e,H n ,d u )< 



< 9™+! [ 4MeVn i {4MeVn\~\ n ( k+2 } \sMeV ^„ (sMeV 



V e 



L e 



■In 



< 



< 



8Me 



Vn l„/8MeVn\] n ( k+3 ) 



ln( 



From claim (2-D.3), we see that 



P(VheH n ,\I einp [h]-I[h]\<e)> 



> 1-6 



as long as 



(2.26) 



C(e/l6,A,d L i)e-Td^ t21 < 
which in turn is satisfied as long as (by Claim 2-D.4) 



which implies 



C(e/64:U } H n} d L i)e-^ 21 <- 



{ -2WMeVUn In ( ±256MeVU 



<n 



n(k+3) 



In other words, 



■ e~Tdu? t21 < I 

— 4 
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^ ln (^))" ( * + v</«<! 

for constants A, B. The latter inequality is satisfied as long as 

which implies 

2n(k + 3)(ln(An) - ln(e)) - e 2 l/B < ln(6/4) 
and in turn implies 

e 2 / > Bln(4:/6) + 2Bn(k + 3)(ln(An) - ln(e)). 
We now show that the above inequality is satisfied for 

_ (B [ln(4/<5) + 2n(k + 3) ln(An) + n(k + 3)ln(/)]\ 1/2 

Putting the above value of e in the inequality of interest, we get 

e 2 (l/B) = ln(4/<5) + 2n(k + 3) ln(An) + n(k + 3) ln(/) > 

> ln(4/<5) + 2n(k + 3) ln(An) + 

+ 2n(k + 3) 2 m [ B ^ 4 / S ) + 2n (k+3)]n(An)+n(k+3)]ii(l)]) 

In other words, 

n(k + 3)ln(l) > 

— n \^ + ^) ^ {B\[n(4/S) + 2n(k+3)\n(An)+n(k+3)\n(l)] 

Since 

B [ln(4/<5) + 2n(k + 3) In (An) + n(k + 3) ln(/)] > I 

the inequality is obviously true for this value of e. Taking this value of e then proves 
our claim. □ 
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2-D.3 Bounding the generalization error 

Finally we are able to take our results in Parts II and III to prove our main result: 
Theorem 2-D.3 With probability greater than 1 — 8 the following inequality is valid: 



\\f0-fn,l\\h(P)<O^-)+O 



nkln(nl) — In 8 
1 



1/2N 



Proof: We have seen in statement (2.2.1) that the generalization error is bounded 
as follows: 

||/o - fn,l\\h(P) ^ £ ( n ) + 2t °(h n i f>) ■ 

In section (2-D.l) we showed that 

e(n) = 0(- 



and in claim (2-D.10) we showed that 



cu(7, n,5) = O 



nkln(nl) — In 8 



I 



1/2N 



Therefore the theorem is proved putting these results together. □ 
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Chapter 3 



Investigating the Sample Complexity of 
Active Learning Schemes 

Abstract 

In the classical learning framework of the previous chapter (akin to PAC) examples were randomly 
drawn and presented to the learner. In this chapter, we consider the possibility of a more active 
learner who is allowed to choose his/her own examples. Our investigations can be divided into two 
natural parts. The first, is in a function approximation setting, and develops an adaptive sampling 
strategy (equivalent to adaptive approximation) motivated from the standpoint of optimal recovery 
(Micchelli and Rivlin, 1976). We provide a general formulation of the problem. This can be regarded 
as sequential optimal recovery. We demonstrate the application of this general formulation to two 
special cases of functions on the real line 1) monotonically increasing functions and 2) functions 
with bounded derivative. An extensive investigation of the sample complexity of approximating 
these functions is conducted yielding both theoretical and empirical results on test functions. Our 
theoretical results (stated in PAC-style), along with the simulations demonstrate the superiority of 
our active scheme over both passive learning as well as classical optimal recovery. The second part 
of this chapter is in a concept learning framework and discusses the idea of e-focusing: a scheme 
where the active learner can iteratively draw examples from smaller and smaller regions of the input 
space thereby gaining vast improvements in sample complexity. 

In Chapter 2, we considered a learning paradigm where the learner's hypothesis 
was constrained to belong to a class of functions which can be represented by a 
sum of radial basis functions. It was assumed that the examples ((x } y) pairs) were 
drawn according to some fixed, unknown, arbitrary, probability distribution. In this 
important sense, the learner was merely a passive recipient of information about 
the target function. In this chapter, we consider the possibility of a more active 
learner. There are of course a myriad of ways in which a learner could be more active. 
Consider, for example, the extreme pathological case where the learner simply asks for 
the true target function which is duly provided by an obliging oracle. This, the reader 
will quickly realize is hardly interesting. Such pathological cases aside, this theme of 
activity on the part of the learner has been explored (though it is not always conceived 
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as such) in a number of different settings (PAC-style concept learning, boundary- 
hunting pattern recognition schemes, adaptive integration, optimal sampling etc.) in 
more principled ways and we will comment on these in due course. 

For our purposes, we restrict our attention in this chapter to the situation where 
the learner is allowed to choose its own examples 13 , in other words, decide where 
in the domain D (for functions defined from D to Y) it would like to sample the 
target function. Note that this is in direct contrast to the passive case where the 
learner is presented with randomly drawn examples. Keeping other factors in the 
learning paradigm unchanged, we then compare in this chapter, the active and passive 
learners who differ only in their method of collecting examples. At the outset, we are 
particularly interested in whether there exist principled ways of collecting examples 
in the hrst place. A second important consideration is whether these ways allow the 
learner to learn with a fewer number of examples. This latter question is particularly 
in keeping with the spirit of this thesis, viz., the informational complexity of learning 
from examples. 

This chapter can be divided into two parts which are roughly self-contained. In 
Part I, we consider active learning in an approximation-theoretic setting. We develop 
a general framework for collecting examples for approximating (learning) real- valued 
functions. We then demonstrate the application of these to some specific classes 
of functions. We obtain theoretical bounds on the sample complexity of the active 
and passive learners, and perform some empirical simulations to demonstrate the 
superiority of the active learner. Part II discusses the idea of e-focusing-a paradigm 
in which the learner iteratively focuses in on specific "interesting" regions of the input 
space to collect its examples. This is largely in a concept learning (alternatively, 
pattern classification) setting. We are able to show how using this idea, one can get 
large gains in sample complexity for some concept classes. 

Part I: Active Learning for Approximation of Real 
Valued Functions 



13 This can be regarded as a computational instantiation of the psychological practice of selective 
attention where a human might choose to selectively concentrate on interesting or confusing regions 
of the feature space in order to better grasp the underlying concept. Consider, for example, the 
situation when one encounters a speaker with a foreign accent. One cues in to this foreign speech by 
focusing on and then adapting to its distinguishing properties. This is often accomplished by asking 
the speaker to repeat words which are confusing to us. 
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3.1 A General Framework For Active Approxi- 
mation 

3.1.1 Preliminaries 

We need to develop the following notions: 

T: Let T denote a class of functions from some domain D to Y where Y is a subset 
of the real line. The domain D is typically a subset of R k though it could be more 
general than that. There is some unknown target function / £ T which has to be 
approximated by an approximation scheme. 

T>: This is a data set obtained by sampling the target / £ T at a number of points 
in its domain. Thus, 

V = {(xi,yi)\xi £ D,y t = f(xi), i = 1 . . . n} 

Notice that the data is uncorrupted by noise. 

Ti: This is a class of functions (also from D to Y) from which the learner will choose 
one in an attempt to approximate the target /. Notationally, we will use Ti to refer 
not merely to the class of functions (hypothesis class) but also the algorithm by means 
of which the learner picks an approximating function h £ Ti on the basis of the data 
set T>. In other words, Ti denotes an approximation scheme which is really a tuple 
< 7i, A > . A is an algorithm that takes as its input the data set T> } and outputs an 

heH. 

Examples: If we consider real- valued functions from R k to R, some typical examples 
of Ti are the class of polynomials of a fixed order (say q), splines of some fixed order, 
radial basis functions with some bound on the number of nodes, etc. As a concrete 
example, consider functions from [0, 1] to R. Imagine a data set is collected which 
consists of examples, i.e., (x 8 -, j/ 8 ) pairs as per our notation. Without loss of generality, 
one could assume that X{ < Xi + \ for each i. Then a cubic (degree-3) spline is obtained 
by interpolating the data points by polynomial pieces (with the pieces tied together 
at the data points or "knots") such that the overall function is twice-differentiable at 
the knots. Fig. 3-13 shows an example of an arbitrary data set fitted by cubic splines. 



dc : We need a metric to determine how good the approximation learner's approxi- 
mation is. Specifically, the metric dc measures the approximation error on the region 
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Figure 3-13: An arbitrary data set fitted with cubic splines 



C of the domain D. In other words, dc } takes as its input any two functions (say fi 
and f 2 ) from D to R and outputs a real number. It is assumed that dc satishes all 
the requisites for being a real distance metric on the appropriate space of functions. 
Since the approximation error on a larger domain is obviously going to be greater 
than that on the smaller domain, we can make the following two observations: 1) for 
any two sets C\ and C 2 such that C\ C C 2 , dc^fx^fi) < ^c 2 (/i,/2), 2) d D (f 1 ,f 2 ) is 
the total approximation on the entire domain; this is our basic criterion for judging 
the "goodness" of the learner's hypothesis. 

Examples: For real- valued functions from R k to R, the L p c metric defined as dc(fi, $2) 
(Ic l/i ~~ f2\ p d%) 1 serves as a natural example of an error metric. 

C: This is a collection of subsets C of the domain. We are assuming that points in the 
domain where the function is sampled, divide (partition) the domain into a collection 
of disjoint sets d G C such that U" =1 Ci = D. 

Examples: For the case of functions from [0, 1] to R, and a data set T> } a natural 
way in which to partition the domain [0, 1] is into the intervals [x 8 -, x 8+ i), (here again, 
without loss of generality we have assumed that X{ < Xi + \). The set C could be the 
set of all (closed, open, or half-open and half-closed) intervals [a, b] C [0,1]. 

The goal of the learner (operating with an approximation scheme 7i) is to provide 
a hypothesis h G 7i (which it chooses on the basis of its example set T>) as an 
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approximator of the unknown target function / £ T . We now need to formally lay 
down a criterion for assessing the competence of a learner (approximation scheme). 
In recent times, there has been much use of PAC (Valiant 1984) like criteria to assess 
learning algorithms. Such a criterion has been used largely for concept learning but 
some extensions to the case of real valued functions exist (Haussler 1989). We adapt 
here for our purposes a PAC like criterion to judge the efficacy of approximation 
schemes of the kind described earlier. 

Definition 3.1.1 An approximation scheme is said to P-PAC learn the function f £ 
T if for every e > and 1 > 8 > 0, and for an arbitrary distribution P on D } 
it collects a data set T> } and computes a hypothesis h £ Ti such that dr>(h } f) < e 
with probability greater than 1 — 8. The function class T is P-PAC learnable if the 
approximation scheme can P-PAC learn every function in T . The class T is PAC 
learnable if the approximation scheme can P-PAC learn the class for every distribution 
P. 

There is an important clarification to be made about our definition above. Note 
that the distance metric d is arbitrary. It need not be naturally related to the distri- 
bution P according to which the data is drawn. Recall that this is not so in typical 
distance metrics used in classical PAC formulations. For example, in concept learning, 
where the set T consists of indicator functions, the metric used is the L\{P~) metric 
given by ^(1^,1^) = f D \1a — 1_b|P(x)Jx. Similarly, extensions to real-valued func- 
tions typically use an L 2 (P) metric. The use of such metrics imply that the training 
error is an empirical average of the true underlying error. One can then make use of 
convergence of empirical means to true means (Vapnik, 1982) and prove learnability. 
In our case, this is not necessarily the case. For example, one could always come up 
with a distribution P which would never allow a passive learner to see examples in 
a certain region of the domain. However, the arbitrary metric d might weigh this 
region heavily. Thus the learner would never be able to learn such a function class for 
this metric. In this sense, our model is more demanding than classical PAC. To make 
matters easy, we will consider here the case of P — PAC learnability alone, where 
P is a known distribution (uniform in the example cases studied). However, there is 
a sense in which our notion of PAC is easier — the learner knows the true metric d 
and given any two functions, can compute their relative distance. This is not so in 
classical PAC, where the learner cannot compute the distance between two functions 
since it does not know the underlying distribution. 

We have left the mechanism of data collection undefined. Our goal here is the 
investigation of different methods of data collection. A baseline against which we will 
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compare all such schemes is the passive method of data collection where the learner 
collects its data set by sampling D according to P and receiving the point (x, f(x)). If 
the learner were allowed to draw its own examples, are there principled ways in which 
it could do this? Further, as a consequence of this flexibility accorded to the learner 
in its data gathering scheme, could it learn the class T with fewer examples? These 
are the questions we attempt to resolve in this chapter, and we begin by motivating 
and deriving in the next section, a general framework for active selection of data for 
arbitrary approximation schemes. 

3.1.2 The Problem of Collecting Examples 

We have introduced in the earlier section, our baseline algorithm for collecting ex- 
amples. This corresponds to a passive learner that draws examples according to the 
probability distribution P on the domain D. If such a passive learner collects ex- 
amples and produces an output h such that dr>(h } f) is less than e with probability 
greater than I — <5, it P-PAC learns the function. The number of examples that a 
learner needs before it produces such an (e-good,<5-conhdence) hypothesis is called its 
sample complexity. 

Against this baseline passive data collection scheme, lies the possibility of allowing 
the learner to choose its own examples. At the outset it might seem reasonable to 
believe that a data set would provide the learner with some information about the 
target function; in particular, it would probably inform it about the "interesting" 
regions of the function, or regions where the approximation error is high and need 
further sampling. On the basis of this kind of information (along with other infor- 
mation about the class of functions in general) one might be able to decide where to 
sample next. We formalize this notion as follows: 

Let T> = {(xi } yi); i = 1 . . . n} be a data set (containing n data points) which the 
learner has access to. The approximation scheme acts upon this data set and picks an 
h G 1~t (which best fits the data according to the specifics of the algorithm A inherent 
in the approximation scheme). Further, let C,-; i = I, . . . , K(n) 14 be a partition of the 
domain D into different regions on the basis of this data set. Finally let 

Fv = {fer\f{x i ) = y i \/{x i ,y i )eV} 



14 The number of regions K(n) into which the domain D is partitioned by n data points depends 
upon the geometry of D and the partition scheme used. For the real line partitioned into intervals 
as in our example, K(n) = n + 1. For fc-cubes, one might obtain Voronoi partitions and compute 
K(n) accordingly. 
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This is the set of all functions in T which are consistent with the data seen so far. 
The target function could be any one of the functions in Tj>- 

We hrst define an error criterion ec (where C is any subset of the domain) as 
follows: 

e c (H, V, T) = sup d c (h,f) 

fer-D 

Essentially, ec is a measure of the maximum possible error the approximation 
scheme could have (over the region C) given the data it has seen so far. It clearly 
depends on the data, the approximation scheme, and the class of functions being 
learned. It does not depend upon the target function (except indirectly in the sense 
that the data is generated by the target function after all, and this dependence is 
already captured in the expression). We thus have a scheme to measure uncertainty 
(maximum possible error) over the different regions of the input space D. One possible 
strategy to select a new point might simply be to sample the function in the region 
Ci where the error bound is the highest. Let us assume we have a procedure V to 
do this. V could be to sample the region C at the centroid of C, or sampling C 
according to some distribution on it, or any other method one might fancy. This can 
be described as follows: 

Active Algorithm A 

1. [Initialize] Collect one example (3:1,2/1) by sampling the domain D once ac- 
cording to procedure V . 

2. [Obtain New Partitions] Divide the domain D into regions C\, . . . , Ck(i) on 
the basis of this data point. 

3. [Compute Uncertainties] Compute ec t for each i. 

4. [General Update and Stopping Rule] In general, at the jth stage, suppose 
that our partition of the domain D is into C;, i = I . . . K(j). One can compute 
ec, for each i and sample the region with maximum uncertainty (say Ci) accord- 
ing to procedure V . This would provide a new data point (:e?+i, J/j+i)- The new 
data point would re-partition the domain D into new regions. At any stage, if 
the maximum uncertainty over the entire domain ejj is less than e stop. 

The above algorithm is one possible active strategy. However, one can carry the 
argument a little further and obtain an optimal sampling strategy which would give us 
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a precise location for the next sample point. Imagine for a moment, that the learner 
asks for the value of the function at a point x £ D. The value returned obviously 
belongs to the set 

Fv{x) = {f{x)\feFv} 

Assume that the value observed was y £ Tv{x). In effect, the learner now has one 
more example, the pair (x, j/), which it can add to its data set to obtain a new, larger 
data set T>' where 

V = X>U (x,y) 

Once again, the approximation scheme Ti would map the new data set T>' into a 
new hypothesis h'. One can compute 

ec(H, V ^T) = sup d(h',f) 

Clearly, er>(7{ } T>' , J 7 ) now measures the maximum possible error after seeing this 
new data point. This depends upon (x, y) (in addition to the usual 7i, T> } and J 7 ). For 
a fixed x, we don't know the value of y we would observe if we had chosen to sample 
at that point. Consequently, a natural thing to do at this stage is to again take a 
worst case bound, i.e., assume we would get the most unfavorable y and proceed. 
This would provide the maximum possible error we could make if we had chosen to 
sample at x. This error (over the entire domain) is 

sup eoCH, £>', T) = sup eoCH, T> U (x, y), T) 

Naturally, we would like to sample the point x for which this maximum error is 
minimized. Thus, the optimal point to sample by this argument is 

x new = arg min sup e D (H, V U (x, y), T) (3.27) 

x€D v€Ft>(x) 

This provides us with a principled strategy to choose our next point. The following 
optimal active learning algorithm follows: 

Active Algorithm B (Optimal) 

1. [Initialize] Collect one example (xi, j/i) by sampling the domain D once accord- 
ing to procedure V . We do this because without any data, the approximation 
scheme would not be able to produce any hypothesis. 
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2. [Compute Next Point to Sample] Apply eq. 3.27 and obtain x 2 . Sampling 
the function at this point yields the next data point (£2,2/2) which is added to 
the data set. 

3. [General Update and Stopping Rule] In general, at the jth stage, assume 
we have in place a data set T>j (consisting of j data). One can compute x 1+ i 
according to eq. 3.27 and sampling the function here one can obtain a new hy- 
pothesis and a new data set T>j + \. In general, as in Algorithm A, stop whenever 
the total error e£)(7i, X\, J 7 ) is less than e. 

By the process of derivation, it should be clear that if we chose to sample at some 
point other than that obtained by eq. 3.27, an adversary could provide a y value and 
a function consistent with all the data provided (including the new data point), that 
would force the learner to make a larger error than if the learner chose to sample at 
x new . In this sense, algorithm B is optimal. It also differs from algorithm A, in that it 
does not require a partition scheme, or a procedure V to choose a point in some region. 
However, the computation of x new inherent in algorithm B is typically more intensive 
than computations required by algorithm A. Finally, it is worthwhile to observe that 
crucial to our formulation is the derivation of the error bound er>(7{ } T> } J 7 ). As we 
have noted earlier, this is a measure of the maximum possible error the approximation 
scheme Ti could be forced to make in approximating functions of T using the data 
set T>. Now, if one wanted an approximation scheme independent bound, this would 
be obtained by minimizing e^ over all possible schemes, i.e., 

mfe D (H,V,F) 

Any approximation scheme can be forced to make at least as much error as the above 
expression denotes. Another bound of some interest is obtained by removing the 
dependence of e^ on the data. Thus given an approximation scheme 7i, if data T> is 
drawn randomly, one could compute 

P{e D (H } V } f)>e} 

or in an approximation scheme-independent setting, one computes 

P{mte D (H,V^)>e} 

ri 

The above expressions would provide us PAC-like bounds which we will make use of 
later in this chapter. 
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3.1.3 In Context 

Having motivated and derived two possible active strategies, it is worthwhile at this 
stage to comment on the formulation and its place in the context of previous work in 
similar vein executed across a number of disciplines. 

1) Optimal Recovery: The question of choosing the location of points where the 
unknown function will be sampled has been studied within the framework of opti- 
mal recovery (Micchelli and Rivlin, 1976; Micchelli and Wahba, 1981; Athavale and 
Wahba, 1979). While work of this nature has strong connections to our formulation, 
there remains a crucial difference. Sampling schemes motivated by optimal recovery 
are not adaptive. In other words, given a class of functions T (from which the target 
/ is selected), optimal sampling chooses the points X{ £ D, i = 1, . . . , n by optimizing 
over the entire function space T . Once these points are obtained, then they remain 
fixed irrespective of the target (and correspondingly the data set T>). Thus, if we 
wanted to sample the function at n points, and had an approximation scheme Ti with 
which we wished to recover the true target, a typical optimal recovery formulation 
would involve sampling the function at the points obtained as a result of optimizing 
the following objective function: 

arg min sup d(f, h(V = {(x t , f{xi)) l=1 , ..„})) (3.28) 

X\ ,...,X n e ,- .17 

where h(T> = {(a^i, /(^i))i=i...n}) £ 7i is the learner's hypothesis when the target is 
/ and the function is sampled at the x^s. Given no knowledge of the target, these 
points are the optimal to sample. 

In contrast, our scheme of sampling can be conceived as an iterative application 
of optimal recovery (one point at a time) by conditioning on the data seen so far. 
Making this absolutely explicit, we start out by asking for one point using optimal 
recovery. We obtain this point by 

argminsup d(f, h{V x = {(x x , /(zi))})) 

Xl jaT 

Having sampled at this point (and obtained y\ from the true target), we can now 
reduce the class of candidate target functions to T\^ the elements of T which are 
consistent with the data seen so far. Now we obtain our second point by 

arg min sup d(f, h(V 2 = {(xi,yi), (x 2 , f{x 2 ))})) 
Note that the supremum is done over a restricted set T\ the second time. In this 
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fashion, we perform optimal recovery at each stage, reducing the class of functions 
over which the supremum is performed. It should be made clear that this sequential 
optimal recovery is not a greedy technique to arrive at the solution of eq. 3.28. It 
will give us a different set of points. Further, this set of points will depend upon the 
target function. In other words, the sampling strategy adapts itself to the unknown 
target / as it gains more information about that target through the data. We know 
of no similar sequential sampling scheme in the literature. 

While classical optimal recovery has the formulation of eq. 3.28, imagine a sit- 
uation where a "teacher" who knows the target function and the learner, wishes to 
communicate to the learner the best set of points to minimize the error made by 
the learner. Thus given a function g, this best set of points can be obtained by the 
following optimization 

arg min d(g,h({(xi,g(xi))} i= i_ n )) (3.29) 

X\ ,...,X n 

Eq. 3.28 and eq. 3.29 provide two bounds on the performance of the active learner 
following the strategy of Algorithm B in the previous section. While eq. 3.28 chooses 
optimal points without knowing anything about the target, and, eq. 3.29 chooses 
optimal points knowing the target completely, the active learner chooses points opti- 
mally on the basis of partial information about the target (information provided by 
the data set). 

2) Concept Learning: The PAC learning community (which has traditionally fo- 
cused on concept learning) typically incorporates activity on the part of the learner 
by means of queries, the learner can make of an oracle. Queries (Angluin, 1988) 
range from membership queries (is x an element of the target concept c) to statistical 
queries (Kearns, 1993 ; where the learner can not ask for data but can ask for esti- 
mates of functionals of the function class) to arbitrary boolean valued queries (see 
Kulkarni etal for an investigation of query complexity). Our form of activity can be 
considered as a natural adaptation of membership queries to the case of learning real- 
valued functions in our modified PAC model. It is worthwhile to mention relevant 
work which touches the contents of this chapter at some points. The most significant 
of these is an investigation of the sample complexity of active versus passive learning 
conducted by Eisenberg and Rivest (1990) for a simple class of unit step functions. It 
was found that a binary search algorithm could vastly outperform a passive learner 
in terms of the number of examples it needed to (e, 6) learn the target function. This 
chapter is very much in the spirit of that work focusing as it does on the sample com- 
plexity question. Another interesting direction is the transformation of PAC-learning 
algorithms from a batch to online mode. While Littlestone etal (1991) consider online 
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learning of linear functions, Kimber and Long (1992) consider functions with bounded 
derivatives which we examine later in this chapter. However the question of choosing 
one's data is not addressed at all. Kearns and Schapire (1990) consider the learn- 
ing of p-concepts (which are essentially equivalent to learning classes of real- valued 
functions with noise) and address the learning of monotone functions in this context. 
Again, there is no active component on the part of the learner. 

3) Adaptive Integration: The novelty of our formulation lies in its adaptive nature. 
There are some similarities to work in adaptive numerical integration which are worth 
mentioning. Roughly speaking, an adaptive integration technique (Berntsen et al 
1991) divides the domain of integration into regions over which the integration is 
done. Estimates are then obtained of the error on each of these regions. The region 
with maximum error is subdivided. Though the spirit of such an adaptive approach is 
close to ours, specific results in the held naturally differ because of differences between 
the integration problem (and its error bounds) and the approximation problem. 

4) Bayesian and other formulations: It should be noted that we have a worst-case 
formulation (the supremum in our formulation represents the maximum possible error 
the scheme might have). Alternate bayesian schemes have been devised (Mackay, 
1991; Cohn, 1994) from the perspective of optimal experiment design (Fedorov). 
Apart from the inherently different philosophical positions of the two schemes, an 
indepth treatment of the sample complexity question is not done. We will soon 
give two examples where we address this sample complexity question closely. In a 
separate piece of work (Sung and Niyogi, 1994) , the author has also investigated such 
bayesian formulations from such an information-theoretic perspective. Yet another 
average-case formulation comes from the information-complexity viewpoint of Traub 
and Wozniakovski (see Traub etal (1988) for details). Various interesting sampling 
strategies are suggested by research in that spirit. We do not attempt to compare 
them due to the difficulty in comparing worst-case and average-case bounds. 

5) Generating Examples and "Hints": Rather than choosing its new examples, 
the learner might generate them by virtue of having some prior knowledge of the 
learning task. For example, prior knowledge that the target function is odd would 
allow the learner to generate a new (symmetric) example: for every (x } f(x)) pair, 
the learner could add the example (— x, —f(x)) to the training set. For vision tasks, 
Poggio and Vetter (1992) use similarity transformations like rotation, translation and 
the like to generate new images from old ones. More generally, Abu-Mostafa (1993) 
has formalized the approach as learning from hints showing how arbitrary hints can 
be incorporated in the learning process. Hints induce activity on the part of the 
learner and the connection between the two is worth investigating further. 
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Thus, we have motivated and derived in this section, two possible active strategies. 
The formulation is general. We now demonstrate the usefulness of such a formulation 
by considering two classes of real- valued functions as examples and deriving specific 
active algorithms from this perspective. At this stage, the important question of 
sample complexity of active versus passive learning still remains unresolved. We 
investigate this more closely by deriving theoretical bounds and performing empirical 
simulation studies in the case of the specific classes we consider. 

3.2 Example 1: A Class of Monotonically Increas- 
ing Bounded Functions 

Consider the following class of functions from the interval [0, 1] C 3? to 3? : 

T = {/ : < / < M, and f(x) > f(y)Vx > y} 

Note that the functions belonging to this class need not be continuous though they 
do need to be measurable. This class is PAC- learnable (with an Li(P) norm, in 
which case our notion of PAC reduces to the classical notion) though it has infinite 
pseudo-dimension 15 (in the sense of Pollard (1984)). Thus, we observe: 

Observation 1 The class T has infinite pseudo-dimension (in the sense of Pollard 
(1984); Haussler (1989),). 

Proof: To have infinite pseudo-dimension, it must be the case that for every n > 0, 
there exists a set of points {x-±, . . . , x n } which is shattered by the class T . In other 
words, there must exist a fixed translation vector t = (ti, . . . , t n ) such that for every 
boolean vector b = (&i, . . . , b n ), there exists a function / £ T which satisfies f(xi) — 
ti > <^> h{ = 1. To see that this is indeed the case, let the n points be X{ = i/(n +1) 
for i going from 1 to n. Let the translation vector then be given by ti = X{. For an 
arbitrary boolean vector b we can always come up with a monotonic function such 
that f(xi) = i/(n + 1) - l/3(n + 1) if k = and f( Xi ) = i/(n + 1) + l/3(n + 1) if 

k = 1. □ 

We also need to specify the terms 7i, dc } the procedure V for partitioning the 
domain D = [0, 1] and so on. For our purposes, we assume that the approximation 
scheme Ti is hrst order splines. This is simply finding the monotonic function which 



15 Finite pseudo-dimension is only a sufficient and not necessary condition for PAC learnability as 
this example demonstrates. 



103 



interpolates the data in a piece-wise linear fashion. A natural way to partition the 
domain is to divide it into the intervals [0, x^), [x^, x 2 ) } . . . , [xi, x 8+ i), . . . , [x n , 1]. The 
metric dc is an L p metric given by dc(fi, $2) = (Jo |/i ~~ f2\pdx) 1 ' p . 

Note that we are specifically interested in comparing the sample complexities of 
passive and active learning. We will do this under a uniform distributional assump- 
tion, i.e., the passive learner draws its examples by sampling the target function 
uniformly at random on its domain [0,1]. In contrast, we will show how our gen- 
eral formulation in the earlier section translates into a specific active algorithm for 
choosing points, and we derive bounds on its sample complexity. We begin by hrst 
providing a lower bound for the number of examples a passive PAC learner would 
need to draw to learn this class T . 

3.2.1 Lower Bound for Passive Learning 

Theorem 3.2.1 Any passive learning algorithm (more specifically, any approxima- 
tion scheme which draws data uniformly at random and interpolates the data by any 
arbitrary bounded function) will have to draw at least |(Af/2e) p ln(l/S) examples to 
P-PAC learn the class where P is a uniform distribution. 

Proof: Consider the uniform distribution on [0, 1] and a subclass of functions which 
have value on the region A = [0,1 — (2e) p ] and belong to T . Suppose the passive 
learner draws / examples uniformly at random. Then with probability (1 — (2e/M) p ) 1 , 
all these examples will be drawn from region A. It only remains to show that for 
the subclass considered, whatever be the function hypothesized by the learner, an 
adversary can force it to make a large error. 

Suppose the learner hypothesizes that the function is h. Let the value of 
(S(i-(2e/M)p,i) \Hx)\ p dxy/ p be x- Obviously < X < (M*(2e/M)* f' p = 2e. If X < e, 
then the adversary can claim that the target function was really 

JO for x G [0, 1 - (2e/M) p ] 
9 ^ X ' ~ \ M for x G (1 - (2e/M) p , 1] 

If, on the other hand X ^ e ? then the adversary can claim the function was really 

9 = 0- 

In the hrst case, by the triangle inequality, 

d(h,g) = (/[o,!] \g ~ hfdx) 1 ^ > Uii-^/My,!] \9 ~ hfdx) 1 ^ 
> (J(i-(2 £ /M)M) Mvdxflv - (/ (1 _ (2e/M)Pil) Wdxflv = 2e- X >e 
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In the second case, 

d(h,g) = (f \g-h\ p dx) 1/p >([ \0-h\ p dx) 1/p = X > e 

J [0,1] J(l-(2e/M)P,l) 

Now we need to find out how large / must be so that this particular event of 
drawing all examples in A is not very likely, in particular, it has probability less than 
6. 

For (I — (2e/M) p ) 1 to be greater than S, we need / < _i n ( 1 _} 2 e/M)p) ^ n (|)- ^ * s a ^ ac ^ 
that for a < 1/2, ^ < _- il J 1 _ a \ - Making use of this fact (and setting a = (2e/M) p , we 
see that for e < (f)(|) 1/p , we have |(M/2e) p ln(l/6) < _ ln(1 _ ( 1 2e/M)P) Mf)- So unless 
/ is greater than |(Af/2e) p ln(l/<5), the probability that all examples are chosen from 
A is greater than 6. Consequently, with probability greater than <5, the passive learner 
is forced to make an error of atleast e, and PAC learning cannot take place. □ 

3.2.2 Active Learning Algorithms 

In the previous section we computed a lower bound for passively PAC learning this 
class for a uniform distribution 16 . Here we derive an active learning strategy (the 
CLA algorithm) which would meaningfully choose new examples on the basis of in- 
formation gathered about the target from previous examples. This is a specific in- 
stantiation of the general formulation, and interestingly yields a "divide and conquer" 
binary searching algorithm starting from a different philosophical standpoint. We for- 
mally prove an upper bound on the number of examples it requires to PAC learn the 
class. While this upper bound is a worst case bound and holds for all functions in 
the class, the actual number of queries (examples) this strategy takes differs widely 
depending upon the target function. We demonstrate empirically the performance of 
this strategy for different kinds of functions in the class in order to get a feel for this 
difference. We derive a classical non-sequential optimal sampling strategy and show 
that this is equivalent to uniformly sampling the target function. Finally, we are able 
to empirically demonstrate that the active algorithm outperforms both the passive 
and uniform methods of data collection. 

Derivation of an optimal sampling strategy 

Consider an approximation scheme of the sort described earlier attempting to ap- 
proximate a target function / £ T on the basis of a data set T>. Shown in fig. 3-14 



16 Naturally, this is a distribution-free lower bound as well. In other words, we have demonstrated 
the existence of a distribution for which the passive learner would have to draw at least as many 
examples as the theorem suggests. 
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Figure 3-14: A depiction of the situation for an arbitrary data set. The set Tj> consists 
of all functions lying in the boxes and passing through the datapoints (for example, 
the dotted lines). The approximating function h is a linear interpolant shown by a 
solid line. 



is a picture of the situation. We can assume without loss of generality that we start 
out by knowing the value of the function at the points x = and x = 1. The points 
{xi]i = l,...,n} divide the domain into n + 1 intervals d (i going from to n) 
where d = [x 8 -, Xi + i](x = 0, x n+ i = l).The monotonicity constraint on T permits us 
to obtain rectangular boxes showing the values that the target function could take 
at the points on its domain. The set of all functions which lie within these boxes as 
shown is Tj>- 

Let us hrst compute ec 8 (7i, X), J 7 ) for some interval d. On this interval, the func- 
tion is constrained to lie in the appropriate box. We can zoom in on this box as 
shown in fig. 3-15. 

The maximum error the approximation scheme could have (indicated by the 
shaded region) is clearly given by 

(f\h- f(xi)\ p dx)^ p = ( j B (-xydxfl p = AB^P/ip + lfl p 
Jd Jo B 

where A = f(xi + i) — f(x{) and B = (x 8+ i — Xi). 

Clearly the error over the entire domain e^ is given by 



-D 



8 = 



(3.30) 



The computation of ec is all we need to implement an active strategy motivated 
by Algorithm A in section 3.1. All we need to do is sample the function in the interval 
with largest error; recall that we need a procedure V to determine how to sample this 
interval to obtain a new data point. We choose (arbitrarily) to sample the midpoint 



106 




Figure 3-15: Zoomed version of interval. The maximum error the approximation 
scheme could have is indicated by the shaded region. This happens when the adver- 
sary claims the target function had the value iji throughout the interval. 



of the interval with the largest error yielding the following algorithm. 
The Choose and Learn Algorithm (CLA) 



1. [Initial Step] Ask for values of the function at points x = and x = 1. At this 
stage, the domain [0,1] is composed of one interval only, viz., [0,1]. Compute 
Ei = i )1/p (l - 0) 1 / p |(/(l) - /(0))| and T x = E x . If T x < e, stop and output 
the linear interpolant of the samples as the hypothesis, otherwise query the 
midpoint of the interval to get a partition of the domain into two subintervals 
[0,1/2) and [1/2,1]. 

2. [General Update and Stopping Rule] In general, at the kth stage, suppose 
that our partition of the interval [0, 1] is [x = 0, Xi),[xi, x 2 ) } . . . , [zfc-i, £& = !]• 



(p+i) 



Yy^:(x t - x l _ 1 ) 1 l' p \(f(x i ) - fixi-i] 



We compute the normalized error E 8 - 
for all i = 1,..,&. The midpoint of the interval with maximum E 8 - is queried 
for the next sample. The total normalized error Tk = (J2i=i EfY' p is computed 
at each stage and the process is terminated when Tk < e. Our hypothesis h 
at every stage is a linear interpolation of all the points sampled so far and our 
final hypothesis is obtained upon the termination of the whole process. 

Now imagine that we chose to sample at a point x £ C{ = [x 8 -, Xi + \\ and received 
the value y £ ^Fv(%) (he., y in the box) as shown in the fig. 3-16. This adds one 
more interval and divides d into two subintervals Cn and C 8 - 2 where Cn = [x 8 -, x] and 
Ci2 = [x, Xi + i]. We also correspondingly obtain two smaller boxes inside the larger 
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(x,-,y,) 



„Ji 



(x , y ) 
1+1 • i+r 



Figure 3-16: The situation when the interval d is sampled yielding a new data point. 
This subdivides the interval into two subintervals and the two shaded boxes indicate 
the new constraints on the function. 



box within which the function is now constrained to lie. The uncertainty measure ec 
can be recomputed taking this into account. 

Observation 2 The addition of the new data point (x, y) does not change the un- 
certainty value on any of the other intervals. It only affects the interval d which got 
subdivided. The total uncertainty over this interval is now given by 

e Ci (H,V,F) = (^)^ ((x - Xi )(y - f( Xi )y + (x i+1 - x))((f(x l+1 ) - f( Xi )) - y)*) 1/p 

= G(zr p + (B-z)(A-ry) 1/p 

where for convenience we have used the substitution z = x — X{ } r = y — f(xi), and 
A and B are f(xi + i) — f(x{) and Xi + \ — X{ as above. Clearly z ranges from to B 
while r ranges from to A. 

We hrst prove the following lemma: 
Lemma 3.2.1 



B/2 = arg min sup G (zr p + (B — z)(A 



\p\1/p 



Proof: Consider any z G [0,-B]. There are three cases to consider: 
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Case I z > B/2 : let z = B/2 + a where a > 0. We find 



re[0,A] \re[0,A] 



sup G(zr p + (B -z)(A-r) p y lP = sup G (zr p + (B - z)(A - rf) 



Now, 

sup re[M] G(zr* + (£-z)(A-r)*) = 

sup reM] C ((B/2 + a)r*> + (5/2 -a)(A- r) p ) 

= Gsup re[0}A] B/2(r p + (A- r) p ) + a(r p -(A- r) p ) 

Now for r = A, the expression within the supremum B /2(r p -\-(A— r) p )-\-a(r p — (A— r) p ) 
is equal to (B/2 + a)A p . For any other r £ [0, A], we need to show that 

J B/2(r p + (A - r) p ) + a(r p - (A - r) p ) < (B/2 + a)A p 

or 

J B/2((r/A) p + (i - (r/A)) p ) + a((r/A) p - (i - r/Af) <B/2 + a 

Putting f3 = r/A (clearly f3 £ [0, i], and noticing that (i - f3) p < I - f3 p and f3 p - (I - 
f3) p < f the inequality above is established. Consequently, we are able to see that 

sup G (zr p + (B-z)(A- r) p ) 1/p = G(B/2 + a) 1/p A 

re[0,A] 

Case II Let z = B/2 — a for a > 0. In this case, by a similar argument as above, it 
is possible to show that again, 

sup G (zr p + (B-z)(A- r) p ) 1/p = G(B/2 + a) 1/p A 

re[0,A] 

Case III Finally, let z = B/2. Here 

sup G(zr p + (B-z)(A-r) p ) 1/p = G(B/2) 1/p sup (r p + (A - r) p ) 1/p 

re[0,A] re[0,A] 

Clearly, then for this case, the above expression is reduced to GA(B /2) 1 ' p . Considering 
the three cases, the lemma is proved. □ 

The above lemma in conjunction with eq. 3.30 and observation 2 proves that if we 
choose to sample a particular interval d then sampling the midpoint is the optimal 
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thing to do. In particular, we see that 

min* e[iCW+1 ] swp y€[f{x . )J{x . +l)] e c , (H, V U (x, y),F) = 

( _i_ ) i/p ( £ i± L-L ) i/ P(/(a .. +l) _ f{xi)) = eCi ( H ,V,r)/2V> 

In other words, if the learner were constrained to pick its next sample in the interval 
Ci, then by sampling the midpoint of this interval C;, the learner ensures that the 
maximum error it could be forced to make by a malicious adversary is minimized. In 
particular, if the uncertainty over the interval d with its current data set T> is ec 8 , 
the uncertainty over this region will be reduced after sampling its midpoint and can 
have a maximum value of ec j /2 1 ' p . 

Now which interval must the learner sample to minimize the maximum possible 
uncertainty over the entire domain D = [0,1]. Noting that if the learner chose to 
sample the interval d then 

/ n e p CH V T)\ 

min S upe D=[ o, 1] (n,VU(x,y),r)=[ £ e p c (H,V,F)+ CiK ' ' M 

xeCi = [x i ,Xi + 1 ] ye jr T) \j=0,j^i Z J 

From the decomposition above, it is clear that the optimal point to sample according 
to the principle embodied in Algorithm B is the midpoint of the interval Cj which 
has the maximum uncertainty ec (7i, £>, T) on the basis of the data seen so far, i.e., 
the data set T>. Thus we can state the following theorem 

Theorem 3.2.2 The CLA is the optimal algorithm for the class of monotonic func- 
tions 

Having thus established that our binary searching algorithm (CLA) is optimal, 
we now turn our efforts to determining the number of examples the CLA would need 
in order to learn the unknown target function to e accuracy with 8 confidence. In 
particular, we can prove the following theorem. 

Theorem 3.2.3 The CLA converges in at most (M/e) p steps. Specifically, after col- 
lecting at most (M I e)' p examples, its hypothesis is e close to the target with probability 
1. 

Proof Sketch: The proof of convergence for this algorithm is a little tedious. How- 
ever, to convince the reader, we provide the proof of convergence for a slight variant 
of the active algorithm. It is possible to show (not shown here) that convergence 
times for the active algorithm described earlier is bounded by the convergence time 
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for the variant. First, consider a uniform grid of points (e/M)' p apart on the domain 
[0, 1]. Now imagine that the active learner works just as described earlier but with a 
slight twist, viz., it can only query points on this grid. Thus at the kth stage, instead 
of querying the true midpoint of the interval with largest uncertainty, it will query 
the gridpoint closest to this midpoint. Obviously the intervals at the kth stage are 
also separated by points on the grid (i.e. previous queries). If it is the case that the 
learner has queried all the points on the grid, then the maximum possible error it 
could make is less than e. To see this, let a = e/M and let us hrst look at a specific 
small interval [ka } (k + l)a\. We know the following to be true for this subinterval: 

f(ka) = h(ka) < f(x), h(x) < f((k + l)a) = h((k + l)a) 

Thus 

\f(x)-h(x)\<f((k + l)a)-f(ka) 

and so over the interval [ka } (k + l)a] 

lt 1)a I/O*) - h(x)\>dx < lt 1)a (f((k + 1)«) " f(ka)Ydx 

<(f((k + l)a)-f(ka)ya 
It follows that 

/[0,1] 1/ " h \ PdX = f[0,a) \f ~ h \ Pdx + ■■■ + I[l-a,l] \f ~ h \ Pdx < 

a ((f (a) - f(0)Y + (/(2a) - /(a))* + . . . + (/(I) - /(I - «)) p ) < 

a(f(a) - /(0) + /(2a) - /(a) + . . . + /(I) - /(I - a)f < 

< a(/(I) - f(0)y < aM? 

So if a = (e/M) p , we see that the L p error would be at most f/r x i \f — h\ p dx) < e. 
Thus the active learner moves from stage to stage collecting examples at the grid 
points. It could converge at any stage, but clearly after it has seen the value of the 
unknown target at all the gridpoints, its error is provably less than e and consequently 
it must stop by this time. □ 
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Figure 3-17: How the CLA chooses its examples. Vertical lines have been drawn to 
mark the x-coordinates of the points at which the algorithm asks for the value of the 
function. 



3.2.3 Empirical Simulations, and other Investigations 

Our aim here is to characterize the performance of CLA as an active learning strat- 
egy. Remember that CLA is an adaptive example choosing strategy and the number 
of samples it would take to converge depends upon the specific nature of the target 
function. We have already computed an upper bound on the number of samples it 
would take to converge in the worst case. In this section we try to provide some 
intuition as to how this sampling strategy differs from random draw of points (equiv- 
alent to passive learning) or drawing points on a uniform grid (equivalent to optimal 
recovery following eq. 3.28 as we shall see shortly). We perform simulations on ar- 
bitrary monotonic increasing functions to better characterize conditions under which 
the active strategy could outperform both a passive learner as well as a uniform 
learner. 

Distribution of Points Selected 

As has been mentioned earlier, the points selected by CLA depend upon the specific 
target function. Shown in fig. 3-5 is the performance of the algorithm for an arbitrarily 
constructed monotonically increasing function. Notice the manner in which it chooses 
its examples. Informally speaking, in regions where the function changes a lot (such 
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Figure 3-18: The dotted line shows the density of the samples along the x-axis when 
the target was the monotone-function of the previous example. The bold line is a 
plot of the derivative of the function. Notice the correlation between the two. 



regions can be considered to have high information density and consequently more 
"interesting"), CLA samples densely. In regions where the function doesn't change 
much (correspondingly low information density), it samples sparsely. As a matter of 
fact, the density of the points seems to follow the derivative of the target function as 
shown in fig. 3-18. 

Consequently, we conjecture that 

Conjecture 1 The density of points sampled by the active learning algorithm is pro- 
portional to the derivative of the function at that point for differentiable functions. 

Remarks: 

1. The CLA seems to sample functions according to its rate of change over the 
different regions. We have remarked earlier, that the best possible sampling 
strategy would be obtained by eq. 3.29 earlier. This corresponds to a teacher 
(who knows the target function and the learner) selecting points for the learner. 
How does the CLA sampling strategy differ from the best possible one? Does 
the sampling strategy converge to the best possible one as the data goes to 
infinity? In other words, does the CLA discover the best strategy? These are 
interesting questions. We do not know the answer. 
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Figure 3-19: The situation when a function / £ T is picked, n sample points (the 
x's) are chosen and the corresponding y values are obtained. Each choice of sample 
points corresponds to a choice of the a's. Each choice of a function corresponds to a 
choice of the b's. 



2. We remarked earlier that another bound on the performance of the active strat- 
egy was that provided by the classical optimal recovery formulation of eq. 3.28. 
This, as we shall show in the next section, is equivalent to uniform sampling. 
We remind the reader that a crucial difference between uniform sampling and 
CLA lies in the fact that CLA is an adaptive strategy and for some functions 
might actually learn with very few examples. We will explore this difference 
soon. 

Classical Optimal Recovery 

For an L\ error criterion, classical optimal recovery as given by eq. 3.28 yields a 
uniform sampling strategy. To see this, imagine that we chose to sample the function 
at points X{\ i = 1, . . . , n. Pick a possible target function / and let j/ 8 - = f(x{) for each 
i. We then get the situation depicted in fig. 3-19. The n points divide the domain into 
n + 1 intervals. Let these intervals have length a 8 - each as shown. Further, if [x 8 _i, Xi] 
corresponds to the interval of length a,-, then let j/ 8 - — j/ 8 _i = b{. In other words we 
would get n + 1 rectangles with sides a 8 - and b{ as shown in the figure. 

It is clear that choosing a vector b = (&i, . . . , b n+ \)' with the constraint that 
Y^i=i h = M and bi > is equivalent to defining a set of y values (in other words, 
a data set) which can be generated by some function in the class T . Specifically, the 
data values at the respective sample points would be given by y\ = &i, y 2 = b\ + b 2 
and so on. We can define T\, to be the set of monotonic functions in T which are 
consistent with these data points. In fact, every / £ T would map onto some b, and 
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thus belong to some T\>. Consequently, 

F = U {b:6 8 >0,J]6 8 =M}-^b 

Given a target function / £ J 7 ]-,, and a choice of n points x^ one can con- 
struct the data set T> = {(x 8 , /(x 8 ))} 8= i... n and the approximation scheme generates 
an approximating function h{T>). It should be clear that for an L\ distance metric 
(d(f } h) = J \f — h\dx), the following is true: 

1 n+1 1 

sup d(f, h) = - V a t b t = -a.b 

/e^b Z i=i Z 

Thus, taking the supremum over the entire class of functions is equivalent to 

sup d(f, h(T>)) = sup —a.b 

feT {h:b t >0,J2b t =M} Z 

The above is a straight forward linear programming problem and yields as its solution 
the result |Af max{a 8 -, i = 1, . . . , (n + 1)}. 

Finally, every choice of n points Xi,i = 1, . . . , n results in a corresponding vector 
a where a 8 - > and J2 a i = 1- Thus minimizing the maximum error over all the choice 
of sample points (according to eq. 3.28) is equivalent to 



arg min sup d(f, h(T> = {(x{ } /(^i))}i=i n) = ar g m i n max{a 8 -; i = 1 . . . n+1} 

x lt ...,x n feJ7 - {a:a 8 >0,^a 8 =l} 

Clearly the solution of the above problem is a 8 - = -^— for each i. 

In other words, classical optimal recovery suggests that one should sample the 
function uniformly. Note that this is not an adaptive scheme. In the next section, we 
compare empirically the performance of three different schemes to sample. The pas- 
sive, where one samples randomly, the non-sequential "optimal", where one samples 
uniformly, and the active which follows our sequentially optimal strategy. 

Error Rates and Sample Complexities for some Arbitrary Functions: Some 
Simulations 

In this section, we attempt to relate the number of examples drawn and error made 
by the learner for a variety of arbitrary monotone increasing functions. We begin 
with the following simulation: 
Simulation A: 
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Monotone function-4 (x) 
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Function No. 


Average Random/CLA 


Average Uniform/ CL A 


1 


7.23 


1.66 


2 


61.37 


10.91 


3 


6.67 


1.10 


4 


8.07 


1.62 


5 


6.62 


1.56 



Table 3.1: Shown in this table is the average error rate of the random sampling and 
the uniform sampling strategies when as a multiple of the error rates due to CLA. 
Thus for the function 3 for example, uniform error rates are on an average 1.1 times 
CLA error rates. The averages are taken over the different values of N (number of 
examples) for which the simulations have been done. Note that this is not a very 
meaningful average as the difference in the error rates between the various strategies 
grow with N (as can be seen from the curves)if there is a difference in the order of 
the sample complexity. However they have been provided just to give a feel for the 
numbers. 

are shown in Fig. 3-22 and Table 3.2.3. Notice that the active strategy (CLA) far 
outperforms the passive strategy and clearly has the best error performance. The 
comparison between uniform sampling and active sampling is more interesting. For 
functions like function-2 (which is a smooth approximation of a step function), where 
most of the "information" is located in a small region of the domain, CLA outperforms 
the uniform learner by a large amount. Functions like function-3 which don't have 
any clearly identified region of greater information have the least difference between 
CLA and the uniform learner (as also between the passive and active learner). Finally 
on functions which lie in between these two extremes (like functions 4 and 5) we see 
decreased error-rates due to CLA which are in between the two extremes. 

In conclusion, the active learner outperforms the passive learner. Further, it is 
even better than classical optimal recovery. The significant advantage of the active 
learner lies in its adaptive nature. Thus, for certain "easy" functions, it might con- 
verge very rapidly. For others, it might take as long as classical optimal recovery, 
though never more. 

3.3 Example 2: A Class of Functions with Bounded 
First Derivative 

Here the class of functions we consider are from [0, 1] to R and of the form 

df 

T = {f\f(x) is differentiable and |— — | < d} 
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Error Rate Fn. 4 
-14-12-10 -8 -6 -4 



Error Rate Fn. 2 
■16-14-12-10-8 -6 -4 





Error Rate Fn. 5 
■16-14-12-10 -8 -6 -4 




Error Rate Fn. 3 
■16-14-12-10 -8 -6 




Notice a few things about this class. First, there is no direct bound on the values 
that functions in T can take. In other words, for every M > 0, there exists some 
function / £ T such that f(x) > M for some x £ [0, 1]. However, there is a bound 
on the hrst derivative, which means that a particular function belonging to T cannot 
itself change very sharply. Knowing the value of the function at any point, we can 
bound the value of the function at all other points. So for example, for every / £ J 7 , 
we see that \f(x)\ < dxf(0) < df(0). 

We observe that this class too has infinite pseudo-dimension. We state this without 
proof. 

Observation 3 The class T has infinite pseudo-dimension in the sense of Pollard. 

As in the previous example we would like to investigate the possibility of devising 
active learning strategies for this class. We hrst provide a lower bound on the number 
of examples a learner (whether passive or active) would need in order to e identify this 
class. We then derive in the next section, an optimal active learning strategy (that is, 
an instantiation of the Active Algorithm B earlier). We also provide an upper bound 
on the number of examples this active algorithm would take. 

We also need to specify some other terms for this class of functions. The approxi- 
mation scheme Ti is a hrst order spline as before, the domain D = [0, 1] is partitioned 
into intervals by the data [x 8 -, Xi + \\ (again as before) and the metric dc is an T\ metric 
given by dc(fi, $2) = Ic \fi( x ) ~ f2(x)\dx. The results in this section can be extended 
to an T p norm but we confine ourselves to an T\ metric for simplicity of presentation. 

3.3.1 Lower Bounds 

Theorem 3.3.1 Any learning algorithm (whether passive or active) has to draw at 
least fi((o?/e)) examples (whether randomly or by choosing) in order to PAC learn the 
class T . 

Proof Sketch: Let us assume that the learner collects m examples (passively by 
drawing according to some distribution, or actively by any other means). Now we 
show that an adversary can force the learner to make an error of atleast e if it draws 
less than fi((o?/e)) examples. This is how the adversary functions. 

At each of the m points which are collected by the learner, the adversary claims 
the function has value 0. Thus the learner is reduced to coming up with a hypothesis 
that belongs to T and which it claims will be within an e of the target function. 
Now we need to show that whatever the function hypothesized by the learner, the 
adversary can always come up with some other function, also belonging to J 7 , and 
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agreeing with all the data points, which is more than an e distance away from the 
learner's hypothesis. In this way, the learner will be forced to make an error greater 
than e. 

The m points drawn by the learner, divides the region [0, 1] into (at most) m + 1 
different intervals. Let the length of these intervals be &i, b 2} b 3} ..., b m+ i. The "true" 
function, or in other words, the function the adversary will present, should have value 
at the endpoints of each of the above intervals. We hrst state the following lemma. 

Lemma 3.3.1 There exists a function f G T such that f interpolates the data and 

\f\dx > 



'[o,i] 4(ra + I) 

where k is a constant arbitrarily close to 1. 

Proof: Consider fig. 3-23. The function / is indicated by the dark line. As is shown, 
/ changes sign at each x = X{. Without loss of generality, we consider an interval 
[xi, Xi + i] of length b{. Let the midpoint of this interval be z = (xi + x 8+ i)/2. The 
function here has the values 



/o 



d(x — Xi) for x G [xi, z — a] 

— d(x — Xi + i) for x G [z + a, Xi + \\ 



2 * — I — ' 2 for x G [z — a, z + a] 



Simple algebra shows that 



\j\ dx > rf(-^— ) 2 + arf(-^— ) = d{b\ - a 2 )/i 



Clearly, a can be chosen small, so that 



X ' +1 l xi ; kdb i 

\f\dx > —— 



where k is as close to I as we want. By combining the different pieces of the function 
we see that 

m+l 

2 



/•I hd m+1 



Now we make use of the following lemma, 

Lemma 3.3.2 For a set of numbers &i,..,6 m such that b\ + b 2 + .. + b m = 1, the 
following inequality is true 

b 2 1 + b 2 2 + .. + b 2 m >l/m 
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Proof: By induction. □ 

Now it is easy to see how the adversary functions. Suppose the learner postulates 
that the true function is h. Let Jj x i \h\dx = \- If X > e ? the adversary claims that 
the true function was / = 0. In that case J \h — f\dx = x > e. If on the other hand, 
X < e, then the adversary claims that the true function was / (as above). In that 
case, 

t'l t'l t'l h A 

\ \f-h\dx> \f\dx- \h\dx = — — --x 

Jo Jo Jo 4 ra + 1 



Clearly, if m is less than -^ — 1, the learner is forced again to make an error greater 
than t. Thus in either case, the learner is forced to make an error greater than or 
equal to t if less than ft{d/t) examples are collected (howsoever these examples are 
collected). □ 

The previous result holds for all learning algorithms. It is possible to show the 
following result for a passive learner. 



Theorem 3.3.2 A Passive learner must draw at least max(fi((o?/e), J {d/ t) ln(l/<5))) 
to learn this class. 

Proof Sketch: The d/e term in the lower bound follows directly from the previous 
theorem. We show how the second term is obtained. 

Consider the uniform distribution on [0, 1] and a subclass of functions which have 
value on the region A = [0, 1 — a] and belong to T . Suppose the passive learner 
draws / examples uniformly at random. Then with probability (1 — a)\ all these 
examples will be drawn from region A. It only remains to show that for this event, 
and the subclass considered, whatever be the function hypothesized by the learner, 
an adversary can force it to make a large error. 

It is easy to show (using the arguments of the earlier theorem) that there exists 
a function / £ J 7 such that / is on A and f 1 _ a \f\dx = ^a 2 d. This is equal to 2e 
if a = J {Ait I d). Now let the learner's hypothesis be h. Let f 1 _ a \h\dx = \- If X i s 
greater than e, the adversary claims the target was g = 0. Otherwise, the adversary 
claims the target was g = f. In either case, J \g — h\dx > t. 

It is possible to show (by an identical argument to the proof of theorem 1), that 
unless / > \\/{d/ 1) ln(l/<5), all examples will be drawn from A with probability greater 
than 6 and the learner will be forced to make an error greater than t. Thus the second 
term appears indicating the dependence on 8 in the lower bound. □ 
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3.3.2 Active Learning Algorithms 

We now derive in this section an algorithm which actively selects new examples on 
the basis of information gathered from previous examples. This illustrates how our 
formulation of section 3.1.1 can be used in this case to effectively obtain an optimal 
adaptive sampling strategy. 

Derivation of an optimal sampling strategy 

Fig. 3-24 shows an arbitrary data set containing information about some unknown 
target function. Since the target is known to have a hrst derivative bounded by d, it is 
clear that the target is constrained to lie within the parallelograms shown in the figure. 
The slopes of the lines making up the parallelogram are d and —d appropriately. Thus, 
Tj> consists of all functions which lie within the parallelograms and interpolate the 
data set. We can now compute the uncertainty of the approximation scheme over 
any interval, C, (given by ec(T~t, T> } J 7 )), for this case. Recall that the approximation 
scheme Ti is a hrst order spline, and the data T> consists of (x, y) pairs. Fig. 3-25 
shows the situation for a particular interval [d = [x 8 -, x 8+ i]). Here i ranges from to 
n. As in the previous example, we let x = 0, and x n+ i = 1. 

The maximum error the approximation scheme Ti could have on this interval is 
given by (half the area of the parallelogram). 



ec t (7~t, T>, T) = sup / \h — f\dx 
fefr, JCi 



(d 2 B? - A] 



id 4g? 

where A{ = |/(x 8+ i) — f(xi)\ and Bi = Xi + \ — X{. Clearly, the maximum error the 
approximation scheme could have over the entire domain is given by 

n ,. n 

eD=[o,i]{'H,V,F) = sup J2 / \f ~ h\dx = J2 e Cj (3.31) 

The computation of ec is crucial to the derivation of the active sampling strategy. 
Now imagine that we chose to sample at a point x in the interval d and received 
a value y (belonging to {F-p(x)). This adds one more interval and divides d into 
two intervals Cn and C 8 - 2 as shown in fig. 3-26.. We also obtain two correspondingly 
smaller parallelograms within which the target function is now constrained to lie. 

The addition of this new data point to the data set iT>' = T>U(x } y)) requires us to 
recompute the learner's hypothesis (denoted by h' in the fig. 3-26). Correspondingly, 
it also requires us to update ec, i.e., we now need to compute ec(?i, T>' } J 7 ). First 
we observe that the addition of the new data point does not affect the uncertainty 
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measure on any interval other than the divided interval C{. This is clear when we 
notice that the parallelograms (whose area denotes the uncertainty on each interval) 
for all the other intervals are unaffected by the new data point. 
Thus, 

ec j (H,V,r) = e Cj {H,V,F) = j^(d 2 B 2 - A]) for j ± i 

For the zth interval C;, the total uncertainty is now recomputed as (half the sum of 
the two parallelograms in fig. 3-26) 



e Ci (H,V,r) = ± ({d 2 u 2 - v 2 ) + (d 2 (B t - uf - (Ai - v) 2 )) 
= ± {{d 2 u 2 + <P(Bi - u) 2 ) - (v 2 + (A- v) 2 )) 



(3.32) 



where u = x — X{ } v = y — yi } and Ai and B{ are as before. Note that u ranges 
from to Bi, for X{ < x < Xi + \. However, given a particular choice of x (this fixes 
a value of u), the possible values v can take are constrained by the geometry of the 
parallelogram. In particular, v can only lie within the parallelogram. For a particular 
x } we know that ^Fv(%) represents the set of all possible y values we can receive. Since 
v = y — yi } it is clear that v £ {F-p(x) — y{. Naturally, if y < yi } we find that v < 0, 
and Ai — v > Ai. Similarly, if y > j/ 8 +i, we find that v > Ai. 
We now prove the following lemma: 

Lemma 3.3.3 The following two identities are valid for the appropriate mini-max 

problem. 

(l)f = axgmin 1ie[ o,B]Sup„ e{jFp(ic) _ 3/ . } {{d 2 u 2 + d 2 {B - u) 2 ) - (v 2 + (A - v) 2 )) 

(2) \(d 2 B 2 - A 2 ) = min ue[0 , B] sup„ e{ ^ (a:) _ w} ((d 2 u 2 + d 2 (B - u) 2 ) - (v 2 + (A- v) 2 )) 

Proof: The expression on the right is a difference of two quadratic expressions and 
can be expressed as q\{u) — q_2{v). For a particular u, the expression is maximized 
when the quadratic q 2 (v) = (v 2 + (A — v) 2 ) is minimized. Observe that this quadratic 
is globally minimized at v = A/2. We need to perform this minimization over the set 
v £ J z v{x) — yi (this is the set of values which lie within the upper and lower boundaries 
of the parallelogram shown in fig. 3-27). There are three cases to consider. 
Case I: u £ [A/2d, B - A/2d] 

First, notice that for u in this range, it is easy to verify that the upper boundary 
of the parallelogram is greater than A/2 while the lower boundary is less than A/2. 
Thus we can find a value of v (viz. v = A/2) which globally minimizes this quadratic 
because A/2 £ ^Fv(%) — ]Ji- The expression thus reduces to d 2 u 2 + d 2 (B — u) 2 — A 2 /2. 
Over the interval for u considered in this case, it is minimized at u = B/2 resulting 
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in the value 

(d 2 B 2 -A 2 )/2 

Case II: u G [0,A/2d] 

In this case, the upper boundary of the parallelogram (which is the maximum value 
v can take) is less than A/2 and hence the q 2 (v) is minimized when v = du. The total 
expression then reduces to 

d 2 u 2 +d 2 (B-u) 2 -((du) 2 + (A-du) 2 ) = d 2 (B-u) 2 -(A-du) 2 = (d 2 B 2 -A 2 )-2ud(dB-A) 

Since, dB > A, the above is minimized on this interval by choosing u = A/2d resulting 
in the value 

dB(dB - A) 

Case III: By symmetry, this reduces to case II. 

Since (d 2 B 2 — A 2 )/2 < dB(dB — A) (this is easily seen by completing squares), it 
follows that u = B j2 is the global solution of the mini-max problem above. Further, 
we have shown that for this value of u, the sup term reduces to (d 2 B 2 — A 2 )/2 and 
the lemma is proved. □ 

Using the above lemma along with eq. 3.32, we see that 

niin sup e Ct (H,VU(x,y)^) = ^-(d 2 B 2 - A 2 ) = \e Ct (H,V^) 

In other words, by sampling the midpoint of the interval C;, we are guaranteed to 
reduce the uncertainty by 1/2. As in the case of monotonic functions now, we see 
that using eq. 3.31, we should sample the midpoint of the interval with largest un- 
certainty eci(H,T),{F) to obtain the global solution in accordance with the principle 
of Algorithm B of section 3.1. 

This allows us to formally state an active learning algorithm which is optimal in 
the sense implied in our formulation. 

The Choose and Learn Algorithm - 2 (CLA-2) 

1. [Initial Step] Ask for values of the function at points x = and x = 1. At this 
stage, the domain D = [0, 1] is composed of one interval only, viz., C\ = [0, 1]. 
Compute ec\ = j^ (d 2 — |/(1) — /(0)| 2 ) and ep = ec\- If &b < e, stop and 
output the linear interpolant of the samples as the hypothesis, otherwise query 
the midpoint of the interval to get a partition of the domain into two subintervals 
[0,1/2) and [1/2,1]. 

2. [General Update and Stopping Rule] In general, at the kth stage, suppose 
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that our partition of the interval [0, 1] is [x = 0, Xi),[xi, x 2 ) } . . . , 
[x k _ 1} x k = I]. We compute the uncertainty e Ct = ^ (d 2 (x, - x 8 _i) 2 - \y % - y % -i\ 2 ) 
for each i = 1, . . . , k. The midpoint of the interval with maximum ec t is queried 
for the next sample. The total error ep = J2i=i e c t is computed at each stage 
and the process is terminated when e^ < e. Our hypothesis h at every stage is 
a linear interpolation of all the points sampled so far and our final hypothesis 
is obtained upon the termination of the whole process. 

It is possible to show that the following upperbound exists on the number of 
examples CLA would take to learn the class of functions in consideration 

Theorem 3.3.3 The CLA-2 would PAC learn the class in at most — t + I examples. 

Proof Sketch: Following a strategy similar to the proof of Theorem 3, we show how 
a slight variant of CLA-2 would converge in at most (o?/4e + 1) examples. Imagine a 
grid of n points placed l/(n — I) apart on the domain D = [0, 1] where the kth point 
is k/(n — I) (for k going from to n — I). The variant of the CLA-2 operates by 
confining its queries to points on this grid. Thus at the kth stage, instead of querying 
the midpoint of the interval with maximum uncertainty, it will query the gridpoint 
closest to this midpoint. Suppose it uses up all the gridpoints in this fashion, then 
there will be n — 1 intervals and by our arguments above, we have seen that the 
maximum error on each interval is bounded by 

i(d 2 ( J -r) 2 -|y.--y.--ir)<^( J T) 2 

4d n — 1 4d n — 1 

Since there are n — 1 such intervals, the total error it could make is bounded by 

(n-l)— d 2( ^2 _ / \ 



4d n — I 4d n — I 

It is easy to show that for n > o?/4e + 1, this maximum error is less than e. Thus 
the learner need not collect any more than o?/4e + I examples to learn the target 
function to within an e accuracy. Note that the learner will have identified the target 
to e accuracy with probability I (always) by following the strategy outlined in this 
variant of CLA-2. □ 

We now have both an upper and lower bound for PAC-learning the class (under 
a uniform distribution) with queries. Notice that here as well, the sample complexity 
of active learning does not depend upon the confidence parameter 6. Thus for 8 
arbitrarily small, the difference in sample complexities between passive and active 
learning becomes arbitrarily large with active learning requiring much fewer examples. 
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3.3.3 Some Simulations 

We now provide some simulations conducted on arbitrary functions of the class of 
functions with bounded derivative (the class J 7 ). Fig. 3-28 shows 4 arbitrary selected 
functions which were chosen to be the target function for the approximation scheme 
considered. In particular, we are interested in observing how the active strategy 
samples the target function for each case. Further, we are interested in comparing 
the active and passive techniques with respect to error rates for the same number of 
examples drawn. In this case, we have been unable to derive an analytical solution to 
the classical optimal recovery problem. Hence, we do not compare it as an alternative 
sampling strategy in our simulations. 

Distribution of points selected 

The active algorithm CLA-2 selects points adaptively on the basis of previous ex- 
amples received. Thus the distribution of the sample points in the domain D of the 
function depends inherently upon the arbitrary target function. Consider for exam- 
ple, the distribution of points when the target function is chosen to be Function- 1 of 
the set shown in fig. 3-28. 

Notice (as shown in fig. 3-29) that the algorithm chooses to sample densely in 
places where the target is flat, and less densely where the function has a steep slope. 
As our mathematical analysis of the earlier section showed, this is well founded. 
Roughly speaking, if the function has the same value at X{ and x 8+ i, then it could 
have a variety of values (wiggle a lot) within. However, if, f(xi + i) is much greater (or 
less) than f(xi), then, in view of the bound, d, on how fast it can change, it would 
have had to increase (or decrease) steadily over the interval. In the second case, the 
rate of change of the function over the interval is high, there is less uncertainty in the 
values of the function within the interval, and consequently fewer samples are needed 
in between. 

In example 1, for the case of monotone functions, we saw that the density of 
sample points was proportional to the hrst derivative of the target function. By 
contrast, in this example, the optimal strategy chooses to sample points in a way 
which is inversely proportional to the magnitude of the hrst derivative of the target 
function. Fig. 3-30 exemplifies this. 

Error Rates: 

In an attempt to relate the number of examples drawn and the error made by the 
learner, we performed the following simulation. 
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Simulation B: 

1. Pick an arbitrary function from class T . 

2. Decide A, the number of samples to be collected. There are two methods of 
collection of samples. The hrst (passive) is by randomly drawing A examples 
according to a uniform distribution on [0, 1]. The second (active) is the CLA-2. 

3. The two learning algorithms differ only in their method of obtaining samples. 
Once the samples are obtained, both algorithms attempt to approximate the 
target by the linear interpolant of the samples (hrst order splines). 

4. This entire process is now repeated for various values of A for the same target 
function and then repeated again for the four different target functions of fig. 3- 

28 

The results are shown in fig. 3-31. Notice how the active learner outperforms the 
passive learner. For the same number of examples, the active scheme having chosen 
its examples optimally by our algorithm makes less error. 

We have obtained in theorem 6, an upper bound on the performance of the active 
learner. However, as we have already remarked earlier, the number of examples the 
active algorithm takes before stopping (i.e., outputting an e-good approximation) 
varies and depends upon the nature of the target function. "Simple" functions are 
learned quickly, "difficult" functions are learned slowly. As a point of interest, we 
have shown in fig. 3-32, how the actual number of examples drawn varies with e. In 
order to learn a target function to e-accuracy, CLA-2 needs at most n max (e) = o?/4e+l 
examples. However, for a particular target function, /, let the number of examples it 
actually requires be re/(e). We plot V \ as a function of e. Notice, hrst, that this 

ratio is always much less than 1. In other words, the active learner stops before the 
worst case upper bound with a guaranteed e-good hypothesis. This is the significant 
advantage of an adaptive sampling scheme. Recall that for uniform sampling (or 
classical optimal recovery even) we would have no choice but to ask for o?/4e examples 
to be sure of having an e-good hypothesis. Further, notice that that as e gets smaller, 
the ratio gets smaller. This suggests that for these functions, the sample complexity 
of the active learner is of a different order (smaller) than the worst case bound. Of 
course, there always exists some function in T which would force the active learner 
to perform at its worst case sample complexity level. 
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3.4 Conclusions, Extensions, and Open Problems 

This part of the chapter focused on the possibility of devising active strategies to 
collect data for the problem of approximating real-valued function classes. We were 
able to derive a sequential version of optimal recovery. This sequential version, by 
virtue of using partial information about the target function is superior to classical 
optimal recovery. This provided us with a general formulation of an adaptive sam- 
pling strategy, which we then demonstrated on two example cases. Theoretical and 
empirical bounds on the sample complexity of passive and active learning for these 
cases suggest the superiority of the active scheme as far as the number of examples 
needed is concerned. It is worthwhile to observe that the same general framework 
gave rise to completely different sampling schemes in the two examples we consid- 
ered. In one, the learner sampled densely in regions of high change. In the other, 
the learner did the precise reverse. This should lead us to further appreciate the fact 
that active sampling strategies are very task-dependent. 

Using the same general formulation, we were also able to devise active strategies 
(again with superior sample complexity gain) for the following concept classes. 1) 
For the class of indicator functions {l[ a ,&] : < a < 6 < 1} on the interval [0,1], 
the sample complexity is reduced from l/eln(l/<5) for passive learning to ln(l/e) by 
adding membership queries. 2) For the class of half-spaces on a regular n-simplex, the 
sample complexity is reduced from n/e\n(l/S) to n 2 ln(s/e) by adding membership 
queries. Note that similar gains have been obtained for this class by Eisenberg (1992) 
using a different framework. 

There are several directions for further research. First, one could consider the 
possibility of adding noise to our formulation of the problem. Noisy versions of 
optimal recovery exist and this might not be conceptually a very difficult problem. 
Although the general formulation (at least in the noise-free case) is complete, it might 
not be possible to compute the uncertainty bounds ec for a variety of function classes. 
Without this, one could not actually use this paradigm to obtain a specific algorithm. 
A natural direction to pursue would be to investigate other classes (especially in more 
dimensions than 1) and other distance metrics to obtain further specific results. We 
observed that the active learning algorithm lay between classical optimal recovery and 
the optimal teacher. It would be interesting to compare the exact differences in a more 
principled way. In particular, an interesting open question is whether the sampling 
strategy of the active learner converges to that of the optimal teacher as more and 
more information becomes available. It would not be unreasonable to expect this, 
though precise results are lacking. In general, on the theme of better characterizing 
the conditions under which active learning would vastly outperform passive learning 
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for function approximation, much work remains to be done. While active learning 
might require fewer examples to learn the target function, its computational burden 
is significantly larger. It is necessary to explore the information/computation trade- 
off with active learning schemes. Finally, we should note, that we have adopted in 
this part, a model of learning motivated by PAC but with a crucial difference. The 
distance metric, d, is not necessarily related to the distribution according to which 
data is drawn (in the passive case). This prevents us from using traditional uniform 
convergence (Vapnik, 1982) type arguments to prove learnability. The problem of 
learning under a different metric is an interesting one and merits further investigation 
in its own right. 

Part II: Epsilon Focusing: A Strategy for Active 
Learning 

In Part I, we discussed a principled strategy by means of which an active learner 
could choose its own examples, thereby potentially reducing the informational com- 
plexity of learning real- valued functions. The formalization adopted ideas from op- 
timal recovery, and active learning reduced to a sequential version of the optimal 
recovery problem. In this part of the chapter, we discuss another possible scheme for 
choosing examples. 

Recall that according to the PAC criterion for learning, we need to learn the 
target function to e accuracy (according to some distance metric d on the space of 
functions, T\ with confidence greater than 1 — 6. Sometimes, knowledge that the 
function lies within some e-ball (in function space) might directly translate (due to 
locality properties) into knowledge about the regions of the domain X over which the 
target function values are uncertain. The learner can then zoom (epsilon-focus) in on 
this region of uncertainty, and sample there. As a motivating real, world example, one 
could imagine that in a pattern classification task, the knowledge that the learner is 
within e of the optimal discriminant boundary, might inform the learner about which 
regions of the feature space are worth sampling to a greater degree. Intuitively, one 
might think that regions close to the decision boundary are such worthwhile regions. 

We formally illustrate this idea with a simple example in the next section. In all 
the cases we consider, the concept class (class of indicator functions) have bounded 
VC dimension. Consequently, they are learnable, and upper and lower bounds on 
the sample complexity of passive learning exist for these function classes. Roughly 
speaking, instead of learning to (e, 6) accuracy at one shot by collecting the requisite 
number of examples, the learner attempts to obtain a loose estimate of the target. 
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Making use of locality properties, then, the learner obtains a loose estimate of the 
regions of the domain to sample more closely. On the basis of these fresh samples, the 
learner tightens its estimate of the target, thereby reducing the region of uncertainty. 
It then freshly samples this new, reduced, region of uncertainty and carries on in this 
fashion. The learner can arbitrarily reduce the sample complexity of learning by this 
scheme. 

After our motivating example, we provide some generalizations, and finally end 
with some open questions. 

3.5 A Simple Example 

Suppose we want to PAC-learn (with (e, S) accuracy) the following class of indicator 
functions from [0, 1] to {0, 1}. 

F = {l[a,i] •• < al} 

Further suppose the distribution P on [0, 1] according to which data is drawn is known 
and is uniform. It is known that a passive learner would take atleast 0((l/e) ln(l/<5)) 
examples to do so. We suggest the following k-step strategy which seeks examples 
from successively smaller well-focused regions of the domain to learn this class in 
il((k/e 2k )ln(k/S) examples. 

The e- focusing Algorithm (1) 

The learning occurs over k (k can be arbitrarily chosen) stages. 

1. Draw enough examples to learn the target with t x ' k accuracy with S/k confi- 
dence. Obtain hypothesis l[ai,l]. 

2. Now ask for examples drawn uniformly at random from the region [di — e 1 ' k , d\ + 
e x ' fc ] and try to learn the target function with e x ' fc /2 accuracy with S/k confi- 
dence (with respect to this new distribution over the smaller region). Obtain 
hypothesis l[a 2 ,i]- 

3. Repeat like step 2, i.e., ask for enough examples drawn uniformly at random 
from the region [a 2 — e 2 ' fc , a 2 + e 2 ' k ] in order to learn the target function to 
e 1 ' k /2 accuracy with S/k confidence. Obtain hypothesis l[a 3 ,i]- In general at 
the jth step, ask for examples drawn uniformly at random from the region 
[a/_i — c- 3-1 '' k , a/_i + e( J ~ 1 >' k ] to learn the target to within e 1 ' k /2 accuracy with 
S/k confidence. Obtain hypothesis l[ a %i]- 

131 



4. Stop with hypothesis 1^,1]- 

Proof of Correctness: Let the target be l[ at ,i]- At the end of the hrst step, the target 
is within e x ' fc of the hypothesis with probability greater than 1 — S/k. This means 
that with high probability \a t — d\\ < e x ' fc or in other words d\ — e x ' fc < a t < d\ + ( 1 ' k - 
We now draw examples only from the region [di — e x ' fc , d\ + e x ' fc ]. Let this distribu- 
tion be P 2 . By a theorem of Vapnik and Chervonenkis, we need to draw 4/e 2 ' ln(k/S) 
examples to learn the target to within e 1 ' k /2 with S/k confidence (for an arbitrary 
distribution) at this stage. This means that 

dp 2 (l [at ,i h l[a 2 ,i]) = l/(2e^ k )\a t -d 2 \ < e^ k /2 



In other words 



I a t — a 2 1 < e 



2/k 



Thus after two steps, the above inequality is true. We now draw examples only 
from the region [a 2 — e 2 ' fc , a 2 + e 2 ' k ]. 

In general, at the jth step, if we draw 4:/e 2 ' k ln(k/S) examples, we would have 
learnt the target to e 1 ' /2 accuracy with S/k conhdence. The distribution (Pj) accord- 
ing to which examples are drawn at this stage is uniform over [a/_i — c- 3-1 '' k , a/_i + 
e 0'-i)A]. Thus, 

^■(l[a t ,i]> l[-%-,i]) = l/(2 e (j - 1)/fc )|at - d 3 \ < e^ k /2. 



So we have, 



K - dj\ < e* lk . 



This happens with probability greater than I — S/k. Thus with high probability, from 
the (j — I)th stage to the jth stage, we have "focused" more closely onto a t . If this 
is true at every stage, we would eventually have after k steps ensured that 

\dk - a t \ < e 

which would mean that we have learnt the target to within an e width. 

If we fail at any stage, the eventual hypothesis a^ is not necessarily within an e 
width of the target. The probability of failing at each stage is less than than S/k so the 
probability of failing in at least one stage is less than k.S/k = S. Thus the probability 
of failing is less than S or in other words with greater than 1 — S probability, we would 
have learnt the target to within an e width which was our goal. 

The total number of examples drawn at each stage is 4/e 2 ' fc ln(k/S) and since 
there are k stages in all, the total number of examples in the whole process is 
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3.6 Generalizations 

This general strategy can be extended to several other scenarios. We introduce the 
notion of localized function classes. These classes which have a local focusing property 
can be learned faster by the method of e-focusing. We mention some concrete results 
obtained by using this scheme for n-dimensional cases, and for the case of noisy 
examples. No proofs or formal arguments are provided for these extensions. We 
hope, though, that the reader will appreciate the spirit of this idea. 

3.6.1 Localized Function Classes 

The previous sections showed how to use the e-focusing strategy to obtain superior 
sample complexity results for some simple concept classes. It is of interest to charac- 
terize general conditions on function classes for which the e-focusing strategy would 
yield such a superior performance. It is noteworthy that the previous function class 
had the property that knowledge of the distance between any two functions / and g 
in T (in the dp metric) allowed us to focus in on a region of interest in the domain 
X = [0, 1] where / and g differ. We formalize this notion to derive a general bound 
on sample complexity for the e-focusing strategy. 

Let T be a concept class (i.e. class of indicator functions) on some compact 
domain X. Let P be the uniform distribution on this domain, i.e., the distribution 
which corresponds to the normalized Lebesgue measure on it. We define the usual 
Li(fi) distance metric on the space functions by 



<W,flO = / \f - g\dfi 

(where // is a probability measure on the set X.) 

We define the local focusing property of such an arbitrarily defined concept class 
as follows: 

Definition 3.6.1 For a given f belonging to some concept class T on X } and for 

any given e > 0, its e-region of interest, lZ t (f) is given by 

{x G X\f(x) ^ g(x) for some g G T such that dp(f,g) < e} 
Definition 3.6.2 The concept class T is said to be locally focused with focusing bound 
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g (g is a real valued function taking values on [0, 1\) if for every e > 0, 

sup V olume(JZ t (f )) < g(e) 

Here, V olume(s) for any set s C X, is simply the volume 17 of that set. We assume 
that Volume(X) = 1. 

Clearly, locally focused classes are those with bounded e regions of interest into 
which we can focus in the iterative manner of Algorithm 1. 

3.6.2 The General e-focusing strategy; 

The general algorithm to learn such e-focused classes is as follows: 
Algorithm 2 

1. Begin with the entire class J 7 , draw examples according to the uniform distri- 
bution P on X, (call this Pi) and attempt to learn the target (f t G T) to t x ' k 
with probability at least 1 — 6/k. Obtain hypothesis f\. Also obtain the reduced 
set of candidate target functions (version space), 

Fi = {fzF\d Pi U,h)<e llk } 
Finally, also obtain the e-region of interest: 

R\ = 7t t i/k(fi). 

2. Draw examples according to a uniform distribution on R\ (call this distribution 
P 2 ) and learn the target to e 2 ' k /g(e 1 ' k ) (according to P 2 ) with probability greater 
than 1 — 6/k. Now obtain hypothesis / 2 £ fi, the reduced version space: 

2/fc 
^ = {fe^\d P2 (fJ 2 )<-^ W) } } 



and R 2 = 7Z e 2/k(f2)- 



ilk 

3. Repeat step 2. In general, at the jth step, learn the target to , i)- 1 )/ k \ (according 



to distribution PA, and obtain f^J 7 ^ and Rj in the obvious way. 



17 From a more formal perspective, one should really replace V olume(s) by the measure on the set 
s, i.e., P(s). Clearly, P(X) = 1. In our case, we assume that Volume(X) = 1. Since P is a uniform 
distribution, i.e., any point in this set is as likely as any other point, it follows that P(s) is simply 
Volume(s). We will continue to use this notation, but the reader will easily see that P can be used 
in general, and in fact, need not even be uniform. 
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4. Stop at the kth step and output hypothesis j k . 

Proof of Learnability: 

Recall that our eventual goal is to learn the unknown target f t within e accuracy 
(according to the distance metric dp) with probability greater than 1 — 6. 

Consider the hrst step. The target has been learned to e x ' fc accuracy with high 
confidence. The learner's hypothesis is f\. Clearly, with high probability (greater 
that 1 — S), the target lies within in an e x ' fc ball around f-y (this is denoted by !F\). 
According to our definition, all functions in T\ agree on the region outside of R\. So 
we only need to sample the region R\ which is what we do in the second step. 

In the second step, we learn the target to e 2 ' k /g(e 1 ' k ). This is according to a 
distribution P 2 (uniform on the region i?i). Again, the target, is within an e 2 ' /^(e 1 ' ) 
ball of the hypothesis at this stage (/2). Thus, 

j (} f \ Volume({x £ Ri\f 2 (x) ^ ft(x)}) 2/k 1/k 

dp 2 {f 2 Jt) = TT-j T^T < C ' g{e ' 

Volume(Ri) 
But, V olume(Ri) = g(e 1 ' k ). Therefore, 

Volume({x £ Ri\f 2 ( x ) + M x )}) < <?' k 
Clearly, then, 



d P (f 2 ,ft) = Volume(X \ i?i)(0) + VoIume({x £ i?i \f 2 (x) ^ f t {x)}) < e 2/k 

Thus, after the second step, we see that the target f t is within e 2 ' k accuracy 
(with respect to our original distribution P). By our definition of the local focusing 
property, we know that f t £ JF 2 , and the points on which f t and f 2 disagree must lie 
within R 2 . 

In general, before the jth step, the points on which the target and the (j — I)th hy- 
pothesis disagree must lie within Rj-i- Since, we sample according to a uniform distri- 
bution on this (Pj), and attempt to learn the target to an accuracy of e J ' k /g(e^ J ~ 1 >' k ), 
by a similar argument, 



dp ,i rx = Volume{{x £ i? J _ 1 |/ J (x) ^ ftjx)}) < £ j/fc/ / £ (j-i)/fcx 
3 Volume(Rj-i) ~ 

But, VolumeiRj-!) = g{^-^l k ). Therefore, 

Volume({x £ R^^f^x) ^ f t {x)}) < e l/k 
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dp(fj,f t ) = Volume(X \ Rj-i)(0) + Volume({x G R^f^x) ± f t (x)}) < e 3 ' k 

Thus, after the jth step, the learner has learned the target to e J ' k accuracy. Fur- 
ther, according to our definition of the local focusing property, the points on which 
the learner and target disagree must lie within the set Rj = lZ eJ /k(f 1 ). 

Clearly, after the kth step, the learner will have learned the target to e accuracy. 
The only way, in which the learner could have made a mistake, is if it made a mistake 
on any one of the steps. The probability of making a mistake in each step is 6/k. The 
probability of making a mistake in any one is bounded by 6. Thus, the learner would 
have identified the target to e accuracy with confidence greater than 1—6. 
Sample Complexity: By the standard Vapnik Chervonenkis theorem, we see that 

2 / ( j 1 ) / k \ 

at the jth stage, the learner will have to draw at most 0( 9 2] /k — ln(k/6)) examples 
to satisfy the learnability requirement of that stage. The total number of examples 
the learner needs would be 

3.6.3 Generalizations and Open Problems 

Now we are in a position to re-evaluate our simple example from this general per- 
spective. It is easy to see that 

1. Opening Example: For an arbitrary f a = l[ a ,i], we see that 

fce(fa) = [a- e,a + e] 

Clearly, g(e) = 2e. The sample complexity is 0((k/e 2 ' k )ln(k/S)). 

2. Box Functions: Consider the following class of indicator functions on [0, I]. 

T = {l[a,b] :0<a<b<l} 
For an arbitrary f a ^ = l[ a ,&], we see that 

T^e{fa,b) = [a - e, a + e] U [b - e, b + e] 
Clearly, g(e) = 4e. The sample complexity 0((k/e 2 ' k )ln(k/S)) follows. 
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Some other generalizations should be noted. We do not attempt to provide any 
formal arguments. 

1. Extensions to n-dimensions: It is possible to extend the e focusing strategy of 
our opening example to an n-dimensional situation. A concrete example includes the 
PAC learning of a concept class of hyperplanes dividing an n-simplex into two regions. 
Essentially, the hyperplane cuts the simplex at its edges. Consequently, along each 
edge, the points on one side of the cut are labelled 0, while the points on the other 
side are labelled 1. Thus, if one confines oneself to finding the intersection of the 
hyperplane with the simplex edge, the problem reduces to a single dimensional case 
exactly like our opening example. If n such edge-intersection problems are solved, 
then the total n-dimensional problem can be solved. 

In view of the fact that we have an effective e- focusing strategy for box functions, 
we can even address concept classes represented by multilayer perceptrons with two 
hidden layers. In such a case, there are at most two hyperplanes intersecting each 
edge. The single-dimensional problem associated with each edge is like a box function. 

2. Handling misclassification noise: The e-focusing strategy in this part has been 
developed for a noise-free case. Extensions to cover a situation with a bound on the 
misclassification noise (the label of the example can be flipped with probability at 
most j]) can easily be considered as well. 

Finally, some natural questions arise at this stage. First, what kinds of concept 
classes have the locally focusing property? Second, given the existence of the locally 
focusing property, how easy is it to compute the e-region of interest 1Z € for such 
concept classes. Further research on these questions is awaited. 
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Figure 3-23: Construction of a function satisying Lemma 2. 




Figure 3-24: An arbitrary data set for the case of functions with a bounded derivative. 
The functions in Tj> are constrained to lie in the parallelograms as shown. The slopes 
of the lines making up the parallelogram are d and — d appropriately. 



< x ;. y ,> 




Figure 3-25: A zoomed version of the zth interval. 
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Figure 3-26: Subdivision of the zth interval when a new data point is obtained. 




Figure 3-27: A figure to help the visualization of Lemma 4. For the x shown, the set 
Tj> is the set of all values which lie within the parallelogram corresponding to this x, 
i.e., on the vertical line drawn at x but within the parallelogram. 
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Figure 3-29: How CLA-2 chooses to sample its points. Vertical lines have been drawn 
at the x values where the CLA queried the oracle for the corresponding function 
value. 
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Figure 3-30: How CLA-2 chooses to sample its points. The solid line is a plot of 
|/'(x)| where / is Function-1 of our simulation set. The dotted line shows the density 
of sample points (queried by CLA-2) on the domain. 
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Chapter 4 



Language Learning Problems in the 
Principles and Parameters Framework 

Abstract 

This chapter considers a learning problem in which the hypothesis class is a class of parameterized 
grammars. After a brief introduction to the "principles and parameters" framework of modern 
linguistic theory, we consider a specific learning problem previously analyzed in a seminal work 
by Gibson and Wexler (1994). With our informational-complexity point of view developed in this 
thesis, we reanalyze their learning problem. This puts particular emphasis on the sample complexity 
of learning, in contrast to previous research in the inductive inference, or Gold frameworks (see 
Osherson and Weinstein, 1986). We show how to formally characterize this problem in particular, and 
a class of learning problems in finite parameter spaces in general, as a Markov structure. Important 
new language learning results follow directly: we explicitly compute sample complexity bounds under 
different distributional assumptions, learning regimes, and grammatical parameterizations. Briefly, 
we may view this as a precise way to model the "poverty of stimulus" children face in language 
acquisition. Our reanalysis alters several conclusions made by Gibson and Wexler. We therefore 
consider this chapter as a useful application of learning-theoretic notions to natural languages, and 
their acquisition. Finally, we describe several directions for further research. 

In Chapters 2 and 3, we considered the problem of learning target functions 
(belonging to certain classes) from examples. Particular emphasis was given to the 
sample complexity of learning such functions, and we have seen how it depends upon 
the complexity of the hypothesis classes concerned. The classes of functions we have 
investigated, have arguably, very little cognitive relevance. However, the investiga- 
tions have helped us to develop a point of view crucial to the analysis of learning 
systems — a point of view which allows us to appreciate the inherent tension between 
the approximation error, and the estimation error, in learning from examples. In 
particular we have seen how the hypothesis classes used by the learner must be large 
to reduce the approximation error, and small to reduce the estimation error. In the 
rest of the thesis (Chapters 4 and 5), we remedy our cognitive irrelevance by con- 
sidering some classes of functions which linguists and cognitive scientists believe the 
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brain must compute. As we shall soon see, there is a learning-theoretic argument 
at the heart of the modern approach to linguistics — hence our choice of linguistic 
structures for analysis. The origin of the research presented in this chapter lies in the 
paper "Triggers" (Gibson and Wexler, 1994; henceforth GW) which marks a seminal 
attempt to formally investigate language learning within the "principles and parame- 
ters" framework (Chomsky, 1981). The results presented in this chapter emerged out 
of a reanalysis of "Triggers" using more sophisticated mathematical techniques, than 
had previously been used in this context. One can, thus, regard this as a demonstra- 
tion, of how our information-theoretic point of view, and the arguments and tools of 
current learning theory, can help us to sharpen certain important questions, and lead 
to insightful analysis of relevant linguistic theories. 

In the next section, we provide a brief account of the learning-theoretic considera- 
tions inherent in the modern approach to linguistics. We then give a brief account of 
the principles and parameters framework, and the issues involved in learning within 
this framework. This sets the stage for our investigations, and we use as a start- 
ing point the Triggering Learning Algorithm (TLA) working on a three-parameter 
syntactic subsystem hrst analyzed by Gibson and Wexler. The rest of the chapter 
analyzes the TLA from the perspective of learnability and sample complexity. Issues 
pertaining to parameter learning in general, and the TLA in particular, are discussed 
at appropriate points. Finally, we suggest various directions for further research — 
this chapter marks only the opening of our research on this theme. Very little work 
has been done on the formal, computational, aspects of parameter setting, and we 
attempt here to pose questions which we think are of importance in the held. 

4.1 Language Learning and The Poverty of Stim- 
ulus 

The inherent tension between having large hypothesis classes, for greater expressive 
power, and small ones, for better learnability, is beautifully instantiated in the human 
language system. Humans develop a mature knowledge of language that is both rich 
and subtle, on exposure to fairly limited number (the so called "poverty of stimulus") 
of example sentences spoken by parents and guardians in childhood. Languages are 
infinite sets of sentences 18 . Yet on exposure to a finite number of them (during the 



18 There are an infinite number of sentences in the English language. You haven't heard all of 
them, yet you can judge the grammaticality of sentences you have not heard before. In the view of 
many linguists, you have internalized a grammar-a set of rules, a theory, or schema, by means of 
which you are able to generalize to unseen sentences (examples). 
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language acquisition phase in childhood) children correctly generalize to the infinite 
set. Further, they generalize in exactly the same way: too striking a coincidence to 
be attributed to chance. This motivated Chomsky (1965) to argue that children must 
operate with constrained hypotheses about language — constraints which restrict the 
sorts of generalizations that they can make. These constrained hypothesis classes 
which children operate with, in the language context, are classes of grammars. Chil- 
dren choose one particular grammar 19 from this class, on the basis of the examples 
they have seen. Thus, a child born in a Spanish speaking environment would choose 
the grammar which appropriately describes the data it has seen (Spanish sentences), 
and, similarly, a child born in a Chinese speaking environment chooses a different 
grammar, and so on. Of course, children might make mistakes, and they do. These 
mistakes are often resolved as more data becomes available to the child. Sometimes 
(when this happens, is undoubtedly, of great interest), these mistakes might never be 
resolved — a possibility which we explore in the next chapter. 

Thus, we see, that if we were totally unconstrained in the kinds of hypotheses we 
could make, then, on the basis of a finite data set, we would all generalize in wildly 
different ways, implying, thereby, that we would never be able to learn languages. 
Yet, we learn languages, apparently with effortless ease as children. This realization is 
crucial to linguistics. Humans, thus, are predisposed to choose certain generalizations 
over others, they are predisposed to choose hypotheses belonging to a constrained 
class of grammars — this predisposition is the essence of the innatist view of language; 
the universal constraints on the class of grammars belong to universal grammar. 
Furthermore, such a class of grammars must be large enough to capture the richness 
of language, yet small enough to be learned — exemplifying the tension discussed 
previously. The thrust thus shifted to finding the right constraints incorporated 
in such a class of grammars, in other words, finding the class of grammars of the 
right complexity. Notice, here, the similarity in spirit to the problem of finding a 
regularization network of the right complexity. Consequently, we see that an analysis 



19 It should be pointed out that there are various components of a language. There is its syntax, 
that concerns itself with syntactic units like verbs, noun phrases, etc. and their appropriate com- 
binations. Further, there is its phonology that deals with its sound structure, its morphology that 
deals with word structure, and finally, the vocabulary or "words" which are the building blocks out 
of which sentences are ultimately composed. Acquisition of a language involves the acquisition of 
all of this. We have been using the term grammar in a loose sort of way — it is a system of rules and 
principles which govern the production of acceptable sentences of the language. The grammar too 
could be broken into its syntactic parts, its phonological parts and so on. Some readers, recalling 
vivid memories of stuffy English school teachers, might have a natural resistance to the idea of rigid 
rules of grammaticality. For such people, we note, that while there is undoubtedly greater flexibility 
in word order than such teachers would suggest, it is a fact, that no one speaks "word salad" — with 
absolutely no attention to word order combinations at all. 
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of the complexity of language learning coupled with a computational view of the 
language acquisition device is crucial to the theoretical underpinnings of modern 
linguistics (see Wexler and Culicover (1980) for an excellent formal exposition of this 
idea). 

4.2 Constrained Grammars— Principles and Param- 
eters 

Having recognized the need for constraints on the class of grammars (this can be 
regarded as an attempt to build a hypothesis class with finite learnability dimension 20 ) 
researchers have investigated several possible ways of incorporating such constraints 
in the classes of grammars to describe the natural languages of the world. Examples 
of this range from linguistically motivated grammars such as Head-driven Phrase 
Structure Grammars (HPSG), Lexical- Functional grammars, Optimality theory for 
phonological systems, to bigrams, trigrams and connectionist schemes suggested from 
an engineering consideration of the design of spoken language system. Note that 
every such grammar suggests a very specific model for human language, with its own 
constraints and its own complexity. Model-free, unconstrained, tabula rasa learning 
schemes correspond to hypothesis classes with infinite dimension, and these can never 
be learned in finite time. An important program of research consists of computing 
the sample complexity of learning each of these diverse classes of grammars. 

In this chapter, we conduct our investigations within the purview of the principles 
and parameters framework (Chomsky, 1981). Such a framework attempts to capture 
the "universal" principles common to all the natural languages of the world, (part of 
our biological endowment as human beings possessed of the unique language faculty) 
and the parameters of variation across languages of the world. Roughly speaking, 
there are a finite number of principles governing the production of human languages. 
These abstract principles, can take one of several (finite) specific forms — this spe- 
cific form manifests itself as a rule, peculiar to a particular language (or classes of 
languages). The specific forms that such an abstract principle can take is governed 
by setting an associated parameter to one of several values. In typical versions of 
theories constructed within such a framework, one ends up with a parameterized 



20 In previous chapters, we have utilized the notion of VC-dimension, and pseudo-dimension to 
characterize the complexity of learning real- valued function classes. It is not immediately clear, 
what complexity measure should be used for characterizing classes of grammars-the development 
of a suitable measure, in tune with the demands of the language acquisition process, is an open 
question. 
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class of grammars. The parameters are boolean valued-setting them to one set of 
values, defines the grammar of German (say), setting them to another set of values, 
defines the grammar, perhaps, of Chinese. Specific examples of theories within such 
a framework could include Government and Binding, Head-driven Phrase Structure 
Grammar, Optimality Theory, varieties of lexical-functional grammars and so forth. 
The idea is best illustrated in the form of examples. We provide, now, two examples, 
drawn from syntax, and phonology, respectively. 

4.2.1 Example: A 3-parameter System from Syntax 

Two X-bar parameters: A classic example of a parametric grammar for syntax 
comes from X-bar theory (Chomsky, 1981; Haegeman, 1991). This describes a param- 
eterized phrase structure grammar, which defines the production rules for phrases, 
and ultimately sentences in the language. The general format for phrase structure is 
summarized by the following parameterized production rules: 

XP — > SpecX'(pi = 0) or X' Spec(pi = 1) 

X' — > C ompX' (p 2 = 0) or X'Comp(p 2 = 1) 

X' -> X 

XP refers to an X-phrase, where X, or the "head", is a lexical category like N 
(Noun), V (Verb), A (Adjective), P (Preposition), and so on. Thus, one could gen- 
erate XP, or Noun Phrases, V P, or Verb Phrases, and other phrases in this fashion. 
Spec refers to specifier, in other words, that part of the phrase that "specifies" it, 
roughly like the old in the old book. Comp refers to the complement, roughly a phrase's 
arguments, like an ice-cream in the Verb Phrase ate an ice-cream, or with envy in the 
Adjective Phrase green with envy. Both Spec and Comp can themselves be phrases 
with their own specifiers and complements. Furthermore, in a particular phrase, the 
spec-position, or the comp-position might be blank (in these cases, Spec — > 0, or 
Comp — > respectively). Applying these rules recursively, one can thus generate 
embedded phrases of arbitrary length in the language. Further, these rules are pa- 
rameterized. Languages can be spec-first (pi = 0) or spec-final (pi = 1). Similarly, 
they can be comp-hrst, or comp-hnal. For example, the parameter settings of English 
are (spec-first, comp-hnal). Shown in fig. 4-33 is an embedded phrase which demon- 
strates the use of the X-bar production rules (with the English parameter settings) 
to generate an arbitrary English phrase. 

In contrast, the parameter settings of Bengali are (spec-first, comp-hrst). The 



148 



Spec 



VP XP — > Spec X' 



X' — > X' Comp 



PP (Comp) 



NP (Comp) 




there with his 



Figure 4-33: Analysis of an English sentence. The parameter settings for English are 
spec-hrst, and comp-hnal. 

translation of the same sentence is provided in fig. 4-34. Notice, how a difference in 
the comp-parameter setting causes a difference in word orders. It is claimed that as far 
as basic, underlying word order is concerned, X-bar theory covers all the possibilities 
for natural languages 21 . Languages of the world simply differ in their parameter 
settings. 

One transformational parameter (V2): The two parameters described above de- 
fine generative rules to obtain basic word-order combinations permitted in the world's 
languages. As mentioned before, there are many other aspects which govern the for- 
mation of sentences. For example, there are transformational rules which determine 
the production of surface word order from the underlying (base) word-order structure 
obtained from the production rules above. One such parameterized transformational 
rule that governs the movement of words within a sentence is associated with the 
V2 parameter. It is observed that in German and Dutch declarative sentences, the 
relative order of verbs and their complements seem to vary depending upon whether 
the clause in which they appear is a root clause or subordinate clause. Consider, the 



21 There are a variety of other formalisms developed to take care of finer details of sentence struc- 
ture. This has to do with case theory, movement, government, binding and so on. See Haegeman 
(1991). 
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XP — > Spec X' 



PP(Comp) 




Spec 
(empty) 




Spec 



NP(Comp) 




or paisa niye 

(his) (money) (with) 



Spec 
(empty) 



PP(Comp)- 




Spec P' 

(empty) 




NP (Ccmp) 




Spec 
(empty) 



N 



shekhan theke 
(there) (from) 



V X' — > Comp X' 



douralo 

(ran) 



Figure 4-34: Analysis of the Bengali translation of the English sentence of the earlier 
figure. The parameter settings for Bengali are spec-first, and comp-hrst. 
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following German sentences: 

(l)...dass (that) Karl das (the) Buch (book) kauft (buys). 
...that Karl buys the book. 

(2)... Karl kauft das Buch. 
...Karl buys the book. 

This seems to present a complication in that from these sentences it is not clear 
whether German is comp-hrst (as example 1 seems to suggest) or comp-hnal (as 
example 2 seems to suggest). It is believed (Haegeman, 1991) that the underlying 
word-order form is comp-hrst (like Bengali, and unlike English, in this respect); how- 
ever, the V2 parameter is set for German (p 3 = 1). This implies that finite verbs must 
appear in the exact second position in root declarative clauses (p 3 = would mean 
that this need not be the case). This is a specific application of a transformational 
rule Move- a. For details and analysis, see (Haegeman, 1991). 

Each of these three parameters can take one of two values. There are, thus, 8 
possible grammars, and correspondingly 8 languages by extension, generated in this 
fashion. At this stage, the languages are defined over a vocabulary of syntactic cat- 
egories, like iV, V etc. Applying the three parameterized rules, one would obtain 
different ways of combining these syntactic categories to obtain sentences. Appendix 
A is a list of the set of unembedded (degree-0) sentences obtained for each of the lan- 
guages, L\ through L 8 in this parametric system. The vocabulary has been modified 
so that sentences are now defined over more abstract units than syntactic categories. 

4.2.2 Example: Parameterized Metrical Stress in Phonol- 
ogy 

The previous example dealt with a parameterized family for syntax. As we mentioned 
before, syntax is only one component of language. Here we consider an example from 
phonology; in particular, our example deals with metrical stress which describes the 
possible ways in which words in a language can be stressed. 

Consider the English word, "candidate". This is a three syllable word, com- 
posed of the three syllables, /can/, /di/, and, /date/. A native speaker of American 
English typically pronounces this word by stressing the hrst syllable of this word. 
Similarly, such a native speaker would also stress the hrst syllable of the tri-syllabic 
word, "/al/-/pha/-/bet/" so that it almost rhymes with "candidate". In contrast, a 
French speaker would stress the final syllable of both these words — a contrast which 
is perceived as a "French" accent by the English ear. 

For simplicity, assume that stress has two levels, i.e., each syllable in each word 
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can be either stressed, or unstressed 22 . Thus, an n-syllable long word could have, 
in principle, as many as 2 n different possible ways of being stressed. For a particu- 
lar language, however, only one of these ways is phonologically well-formed. Other 
stress patterns sound accented, or awkward. Words could potentially be of arbitrary 
length 23 . Thus one could write phonological grammars — a functional mapping from 
these words to their correct stress pattern. Clearly, this is another example of a 
functional mapping the brain must compute. Further, different languages correspond 
to different such functions, i.e., they correspond to different phonological grammars. 
Within the principles and parameters framework, these grammars are parameterized 
as well. 

Let us consider a simplified version of two principles associated with 3 boolean 
valued parameters which play a role in the Halle and Idsardi metrical stress system. 
These principles describe how a multisyllable word can be broken into its constituents 
(recall how sentences were composed of constituent phrases in syntax) before stress 
assignment takes place. This is done by a bracketing schema which places brackets 
at different points in the word, thereby marking (bracketing) off different sections as 
constituents. A constituent is then defined as a syllable sequence between consecutive 
brackets. In particular, a constituent must be bounded by a right bracket on its right 
edge, or, a left bracket on its left edge (both these conditions need not be satisfied 
simultaneously). Further, it cannot have any brackets in the middle. Finally, note 
that not all syllables of the word need be part of a constituent. A sequence of 
syllables might not be bracketed by either an appropriate left, or right bracket — 
such a sequence, cannot have a stress-bearing head, and might be regarded as an 
extra-metrical sequence. 

1) the edge parameters: there are two such parameters. 

a) put a left (pi = 0) or right (pi = 1) bracket 

b) put the above mentioned bracket exactly one syllable after the left (p 2 = 0) edge 
or before the right (p 2 = 1) edge of the word. 

2) the head parameter: each constituent (made up of one or more syllables) has a 



22 While we have not provided a formal definition of either stress, or syllable, it is hoped, that at 
some level, the concepts are intuitive to the reader. It should, however, be pointed out that linguists 
differ on their characterization of both these objects. For example, how many levels can stress have? 
Typically, (Halle and Idsardi, 1991) three levels are assumed. Similarly, syllables are classified into 
heavy and light syllables. We have discounted such niceties for ease of presentation. 

23 0ne shouldn't be misled by the fact that that a particular language has only a finite number 
of words. When presented with a foreign word, or a "non-sense" word one hasn't heard before, one 
can still attempt to pronounce it. Thus, the system of stress assignment rules in our native language 
probably dictates the manner in which we choose to pronounce it. Speakers of different languages 
would accent these non-sense words differently. 



152 



"head". This is the stress bearing syllable of the constituent, and is in some sense, 
the primary, or most important syllable of that constituent (recall how syntactic 
constituents, the phrases, had a lexical head). This phonological head could be the 
leftmost (p 3 = 0), or, the rightmost (p 3 = 1) syllable in the constituent. 

Suppose, the parameters are set to the following set of values: [pi = 0, p^ = 
0, p3 = 0]. Fig. 4-35 shows how some multisyllable words would have stress assigned 
to them. In this case, any n-syllable word would have stress in exactly the second 
position (if such a position exists) and no other. In contrast, if [pi = 0, p^ = 0, p 3 = 
1], the corresponding language would stress the final syllable of all multi-syllable 
words. Monosyllabic words are unstressed in both languages. 



X(XXXXX X(XXXXX 



X(XXXX x(xxxx 



X( X( 



Figure 4-35: Depiction of stress pattern assignment to words of different syllable 
length under the parameterized bracketing scheme described in the text. 

These 3 parameters represent a very small (almost trivial) component of stress 
pattern assignment. There are many more parameters which describe in more com- 
plete fashion, metrical stress assignment. At this level of analysis, for example, the 
language Koya has p 3 = 0, while Turkish has p 3 = I; see Kenstowicz (1992) for more 
details. The point of this example was to provide a flavor or how the problem of 
stress-assignment can be described formally by a parametric family of functions. The 
analysis of parametric spaces developed in this chapter can be equally well applied to 
such stress systems. 

4.3 Learning in the Principles and Parameters 
Framework 

Language acquisition in the principles and parameters framework reduces to the set- 
ting of the parameters corresponding to the "target" language. A child is born in an 
arbitrary linguistic environment. It receives examples in the form of sentences it hears 
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in its linguistic environment. On the basis of example sentences it hears, it presum- 
ably learns to set the parameters appropriately. Thus, referring to our 3-parameter 
system for syntax, if the child is born in a German speaking environment, and hears 
German sentences, it should learn to set the V2 parameter, and the spec-parameter 
to spec-first. Similarly, a child hearing English sentences, should learn to set the 
comp-parameter to comp-fmal. In principle, the child is thus solving a parameter 
estimation problem — an unusual class of parameter estimation problems, no doubt, 
but in spirit, little different from the parameter estimation problem associated with 
the regularization networks of Chapter 2. One can thus ask a number of questions 
about such problems. What sort of data does the child need in order to set the target 
parameters? Is such data readily available to the child? How often is such data made 
available to the child? What sort of algorithms does the child use in order to set the 
parameters? How efficient are these algorithms? How much data does the child need? 
Will the child always converge to the target "in the limit" ?? 

Language acquisition, in the context of parameterized linguistic theories, thus, 
gives rise to a class of learning problems associated with finite parameter spaces. 
Furthermore, as emphasized particularly by Wexler in a series of works (Hamburger 
and Wexler, 1975; Culicover and Wexler, 1980; and Gibson and Wexler, 1994), the 
finite character of these hypothesis spaces does not solve the language acquisition 
problem. As Chomsky noted in Aspects of the Theory of Syntax (1965), the key point 
is how the space of possible grammars- even if hnite-is "scattered" with respect to 
the primary language input data. It is logically possible for just two grammars (or 
languages) to be so near each other that they are not separable by psychologically 
realistic input data. This was the thrust of Wexler and Hamburger, and Wexler and 
Culicover's earlier work on the learnability of transformational grammars from simple 
data (with at most 2 embeddings). More recently, a significant analysis of specific 
parameterized theories has come from Gibson and Wexler (1994). They propose the 
Triggering Learning Algorithm — a simple, psychologically plausible algorithm which 
children might conceivably use to set parameters in finite parameter spaces. Inves- 
tigating the performance of the TLA on the 3-parameter syntax subsystem shown 
in the example yields the surprising result, that the TLA cannot achieve the target 
parameter setting for every possible target grammar in the system. Specifically, there 
are certain target parameter settings, for which the TLA could get stuck in local 
maxima from which it would never be able to leave, and consequently, learnability 
would never result. 

We are interested, both in the learnability, and the sample complexity of the finite 
hypothesis classes suggested by the principles and parameters theory. An investi- 
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gation of this sort requires us to define the important dimensions of the learning 
problem — the issues which need to be systematically addressed. The following figure 
provides a schematic representation of the space of possibilities which need to be 
explored in order to completely understand and evaluate a parameterized linguistic 
theory from a learning perspective. The important dimensions are as follows: 

Parametrization 



Distribution of 
Data 




Noise 



Memory Requirements 



Learning Algorithm 



Figure 4-36: The space of possible learning problems associated with parameterized 
linguistic theories. Each axis represents an important dimension along which spe- 
cific learning problems might differ. Each point in this space specifies a particular 
learning problem. The entire space represents a class of learning problems which are 
interesting. 

f ) the parameterization of the language space itself: a particular linguistic theory 
would give rise to a particular choice of universal principles, and associated param- 
eters. Thus, one could vary along this dimension of analysis, the parameterization 
hypothesis classes which need to be investigated. The parametric system for metrical 
stress (Example 2) is due to Halle and Idsardi. A variant, investigated by Dresher 
and Kaye (1990), can equally well be subjected to analysis. 

2)the distribution of the input data: once a parametric system is decided upon, 
one must, then, decide the distribution according to which data (i.e., sentences gener- 
ated by some target grammar belonging to the parameterized family of grammars) is 
presented to the learner. Clearly, not all sentences occur with equal likelihood. Some 
are more likely than others. How does this affect learnability? How does this affect 
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sample complexity? One could, of course, attempt to come up with distribution- 
independent bounds on the sample complexity. This, as we shall soon see, is not 
possible. 

3) the presence, and nature, of noise, or extraneous examples: in practice, children 
are exposed to noise (sentences, which are inconsistent with the target grammar) due 
to the presence of foreign, or idiosyncratic speakers, disfluencies in speech, or a variety 
of other reasons. How does one model noise? How does it affect sample complexity 
or learnability or both? 

4) the type of learning algorithm involved: a learning algorithm is an effective 
procedure mapping data to hypotheses (parameter values). Given that the brain has 
to solve this mapping problem, it then becomes of interest, to study the space of 
algorithms which can solve it. How many of them converge to the target? What is 
their sample complexity? Are they psychologically plausible? 

5) the use of memory: this is not really an independent dimension, in the sense, 
that it is related to the kind of algorithms used. The TLA, and variants, as we shall 
soon see, are memoryless algorithms. These can be modeled by a Markov chain. 

This is the space which needs to be explored. By making a specific choice along 
each of the five dimensions discussed (corresponding to a single point in the 5- 
dimensional space of fig. 4-36, we arrive at a specific learning problem. Varying 
the choices along each dimension (thereby traversing the entire space of fig. 4-36) 
gives rise to the class of learning problems associated with parameterized linguistic 
theories. For our analysis, we choose as a concrete starting point the Gibson and 
Wexler Triggering Learning Algorithm (TLA) working on the 3-parameter syntactic 
subsystem in the example shown. In our space of language learning problems, this 
corresponds to (f) a 3-way parameterization, using mostly X-bar theory; (2) a uni- 
form sentence distribution over unembedded (degree-0) sentences; (3) no noise; (4) a 
local gradient ascent search algorithm; and (5) memoryless (online) learning. Follow- 
ing our analysis of this learning system, we consider variations in learning algorithms, 
sentence distribution, noise, and language/grammar parameterizations. 

4.4 Formal Analysis of the Triggering Learning 
Algorithm 

Let us start with the TLA. We hrst show that this algorithm and others like it is 
completely modeled by a Markov chain. We explore the basic computational conse- 
quences of this fundamental result, including some surprising results about sample 
complexity and convergence time, the dominance of random walk over gradient as- 
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cent, and the applicability of these results to actual child language acquisition, and 
possibly language change. 

Background. Following Gold (1967) the basic framework is that of identification in 
the limit. We assume some familiarity with Gold's assumptions. The learner receives 
an (infinite) sequence of (positive) example sentences from some target language. 
After each, the learner either (i) stays in the same state; or (ii) moves to a new state 
(change its parameter settings). If after some hnite number of examples the learner 
converges to the correct target language and never changes its guess, then it has 
correctly identified the target language in the limit; otherwise, it fails. 

In the GW model (and others) the learner obeys two additional fundamental 
constraints: (1) the single-value constraint — the learner can change only 1 parameter 
value each step; and (2) the greediness constraint — if the learner is given a positive 
example it cannot recognize and changes one parameter value, finding that it can 
accept the example, then the learner retains that new value. The TLA can then be 
precisely stated as follows. See Gibson and Wexler (1994) for further details. 

• [Initialize] Step 1. Start at some random point in the (hnite) space of possible 
parameter settings, specifying a single hypothesized grammar with its resulting 
extension as a language; 

• [Process input sentence] Step 2. Receive a positive example sentence s 8 - at time 
ti (examples drawn from the language of a single target grammar, L(G t )) } from 
a uniform distribution on the degree-0 sentences of the language (we shall be 
able to relax this distributional constraint later on); 

• [Learnability on error detection] Step 3. If the current grammar parses (gener- 
ates) Si, then go to Step 2; otherwise, continue. 

• [Single-step gradient-ascent] Select a single parameter at random, uniformly 
with probability 1/n, to flip from its current setting, and change it (0 mapped 
to 1, 1 to 0) iff that change allows the current sentence to be analyzed; otherwise 
go to Step 2; 

Of course, this algorithm never halts in the usual sense. GW aim to show under 
what conditions this algorithm converges "in the limit" — that is, after some number, 
n, of steps, where n is unknown, the correct target parameter settings will be selected 
and never be changed. They investigate the behavior of the TLA on the linguistically 
natural 3-parameter syntactic subsystem of example 1. Note that a grammar in this 
space is simply a particular n-length array of 0's and l's; hence there are 2 n possible 
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grammars (languages). Gibson and Wexler's surprising result is that the simple 3- 
parameter space they consider is unlearnable in the sense that positive-only examples 
can lead to local maxima- incorrect hypotheses from which a learner can never escape. 
More broadly, they show that learnability in such spaces is still an interesting problem, 
in that there is a substantive learning theory concerning feasibility, convergence time, 
and the like, that must be addressed beyond traditional linguistic theory and that 
might even choose between otherwise adequate linguistic theories. 
Triggers: Various researchers (Lightfoot, 1991; Clark and Roberts, 1993; Gibson 
and Wexler, 1994; Frank and Kapur, 1992) have explored the notion of triggers as a 
way to model parameter space language learning. Intuitively, triggers are supposed to 
represent evidence which allows the child to set the parameter for the target language. 
Concretely, Gibson and Wexler define triggers to be sentences from the target which 
allow a parameter to be correctly set. Thus, global triggers for a particular parameter 
are sentences from the target language which force the learner to set that parameter 
correctly (irrespective of the learner's current hypothesis about the target parameter 
settings). On the other hand, local triggers for a particular parameter depend upon 
the learner's hypothesis. Given values for all parameters but one (the parameter in 
question), local triggers are sentences which force the learner to correctly set the value 
of that parameter. 

Gibson and Wexler suggest that the existence of local triggers for every (hypoth- 
esis, target) pair in the space suffices for TLA learnability to hold. As we shall see 
later, one important corollary of our stochastic formulation shows that this condition 
does not suffice. In other words, even if a triggered path exists from the learner's hy- 
pothesis language to the target, the learner might, with high probability, not take this 
path, resulting in non-learnability. A further consequence is that many of Gibson and 
Wexler's proposed cures for nonlearnability in their example system, such as "matu- 
rational" ordering imposed on parameter settings, simply do not apply. On the other 
hand, this result reinforces Gibson and Wexler's basic point that seemingly simple 
parameter-based language learning models can be quite subtle — so subtle that even a 
superficially complete computer simulation can fail to uncover learnability problems. 

4.4.1 The Markov formulation 

From the standpoint of learning theory, GW leave open several questions that can be 
addressed by a more precise formalization of this model in terms of Markov chains (a 
possible formalization suggested but left unpursued in footnote 9 of GW). 

Consider a parameterized grammar (language) family with n parameters. We can 
picture the hypothesis space, of size 2 n , as a set of points, each corresponding to 
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one particular array of parameter settings (languages, grammars). Call each point 
a hypothesis state or simply state of this space. As is conventional, we define these 
languages over some alphabet E as a subset of E*. One of them is the target language 
(grammar). We arbitrarily place the (single) target grammar at the center of this 
space. Since by the TLA the learner is restricted to moving at most f binary value 
in a single step, the theoretically possible transitions between states can be drawn 
as (directed) lines connecting parameter arrays (hypotheses) that differ by at most f 
binary digit (a or a f in some corresponding position in their arrays). Recall that 
this is the so-called Hamming distance. 

We may further place weights on the transitions from state i to state j. These 
correspond to the probabilities that the learner will move from hypothesis state i to 
state j. In fact, as we shall show below, given a distribution over L(G t ) } we can 
further carry out the calculation of the actual transition probabilities themselves. 
Thus, we can picture the TLA learning space as a directed, labeled graph V with 2 n 
vertices. 24 More precisely, we can make the following remarks about the TLA system 
GW describe. 

Remark. The TLA system is memoryless, that is, given a sequence s of sentences up 
to time ti, the selection of hypothesis h(ti +1 ) depends only on sentence s(t 8 ), and not 
(directly) on previous sentences, i.e., 

p{h(t i+1 ) = h\h(t),s(t),t<ti} = P{h(t i+1 ) = h\h(ti),s(ti)} 

In other words, the TLA system is a classical discrete stochastic process, in par- 
ticular, a discrete Markov process or Markov chain. We can now use the theory 
of Markov chains to describe TLA parameter spaces (Isaacson and Masden, 1976). 
For example, as is well known, we can convert the graphical representation of an 
n-dimensional Markov chain M to an n X n matrix T, where each matrix entry (i,j) 
represents the transition probability from state i to state j. A single step of the 
Markov process is computed via the matrix multiplication T X T; n steps is given by 
T n . A "I" entry in any cell (z, j) means that the system will converge with probability 
I to state j, given that it starts in state i. 

As mentioned, not all these transitions will be possible in general. For example, 
by the single value hypothesis, the system can only move I Hamming bit at a time. 
Also, by assumption, only differences in surface strings can force the learner from one 
hypothesis state to another. For instance, if state i corresponds to a grammar that 



24 GW construct an identical transition diagram in the description of their computer program for 
calculating local maxima. However, this diagram is not explicitly presented as a Markov structure; it 
does not include transition probabilities. Of course, topologically both structures must be identical. 
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generates a language that is a proper subset of another grammar hypothesis j, there 
can never be a transition from j to z, and there must be one from i to j. Further, 
by assumption and the TLA, it is clear that once we reach the target grammar there 
is nothing that can move the learner from this state, since all remaining positive 
evidence will not cause the learner to change its hypothesis. Thus, there must be a 
loop from the target state to itself and no exit arcs. In the Markov chain literature, 
this is known as an Absorbing State (AS). Obviously, a state that only leads to an AS 
will also drive the learner to that AS. Finally, if a state corresponds to a grammar 
that generates some sentences of the target there is always a loop from that state to 
itself, that has some nonzero probability. 
Example. 

Consider the 3-parameter syntax subsystem of Example 1. Its binary parameters are: 
(1) Spec(iher) hrst (0) or last (1); (2) Comp(lement) hrst (0) or last (1); and Verb 
Second (V2) does not exist (0) or does exist (1). As discussed in the example, the 
3 parameters give rise to 8 distinct grammars. Further, these grammars generate 
different combinations of syntactic categories. 

Rather than considering categories of the form Noun, Adjective, and so on, one 
could use more abstract constituents to define the vocabulary of the language. One 
possible approach is to allow the usage of phrases as possible "words" in the language. 
This is what GW choose to do. The net result is that the grammars are now defined 
over a vocabulary, E = {S, V, 0, 01, 02, Adv, Aux}, corresponding to Subject, Verb, 
Object, Direct Object, Indirect Object, Adverb, and Auxiliary verb. See Haegeman 
(1991) for an account of such a transformation. Sentences in E* now correspond to 
concatenations of these basic "words" -which are really phrases. 

For instance, parameter setting (5) corresponds to the array [0 1 0]= Specifier hrst, 
Comp last, and — V2, which works out to the possible basic English surface phrase 
order of Subject- Verb-Object (SVO). As shown in GW's figure (3), the other possible 
arrangements of surface strings corresponding to this parameter setting include SV 
(as in John runs); SV 01 02 (two objects, as in give John an ice-cream); S Aux 

V (as in John is running; S Aux V 0; S Aux V 01 02; Adv S V (where Adv is 
an Adverb, like quickly; Adv SVO; Adv S V 01 02; Adv S Aux V; Adv S Aux 

V 0; and Adv S Aux V 01 02. Shown in appendix A of this chapter are all the 
possible degree-0 (unembedded) sentences generated by the 8 possible grammars of 
this parametric system. 

Suppose SOV (setting ^5=[0 1 0]) is the target grammar (language). With the 
GW 3-parameter system, there are 2 3 = 8 possible hypotheses, so we can draw this 
as an 8-point Markov configuration space, as shown in fig. 4-37. The shaded rings 
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represent increasing Hamming distances from the target. Each labeled circle is a 
Markov state, a possible array of parameter settings or grammar, hence extensionally 
specifies a possible target language. Each state is exactly f binary digit away from 
its possible transition neighbors. Each directed arc between the points is a possible 
(nonzero) transition from state i to state j; we shall show how to compute this 
immediately below. We assume that the target grammar, a double circle, lies at the 
center. This corresponds to the (English) SOV language. Surrounding the bulls- 
eye target are the 3 other parameter arrays that differ from [0 f 0] by one binary 
digit each; we picture these as a ring f Hamming bit away from the target: [0, f , 1], 
corresponding to GW's parameter setting ^6 in their figure 3 (Spec-first, Comp-hnal, 
+V2, basic order SVO+V2); [0 0], corresponding to GW's setting ^7 (Spec-first, 
Comp-hrst, — V2), basic order SOV; and [f f 0], GW's setting $1 (Spec-final, Comp- 
hnal, — V2], basic order VOS. 

Around this inner ring lie 3 parameter setting hypotheses, all 2 binary digits away 
from the target: [0 1], [f 0], and [f f f] (grammars $2, 3, and 8 in GW figure 
3). Note that by the Single Value hypothesis, the learner can only move one grey 
ring towards or away from the target at any one step. Finally, one more ring out, 
three binary digits different from the target, is the hypothesis [f 1], corresponding 
to target grammar 4. 

Using this picture, we can also now readily interpret some of the terminological 
notions in GW's article. A local trigger is simply a datum that would allow the 
learner to move along an ingoing link in the figure. This is because an ingoing link is 
associated with sentences which allow the learner to move f bit closer to the target 
in parameter space, and consequently, set one parameter correctly. For example, the 
link from grammar state 3 to grammar state 7 does correspond to a local trigger, 
as does the link from 4 to 2; however, the link from grammar 3 to 4 is not a local 
trigger. Also, because of the Single Value and Greediness constraints, the learner can 
only either (i) stay in its current state; (ii) move f step inwards (a local trigger); or 
(iii) move f step outwards (note that this also happens given data from the target, 
just as in Case (ii)). These are the only allowed moves; one cannot move to another 
state within the same ring. 

One can also describe the learnability properties of this space more formally. 
In this Markov chain, certain states have no outgoing arcs; these are among the 
Absorbing States (AS) because once the system has made a transition into one of 
these states, it can never exit. More generally, let us define the set of closed states 
CS to be any proper subset of states in the Markov chain such that there is no arc 
from any of the states in C S to any other state in the Markov chain. 
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Figure 4-37: The 8 parameter settings in the GW example, shown as a Markov structure. 
Directed arrows between circles (states, parameter settings, grammars) represent possible 
nonzero (possible learner) transitions. The target grammar (in this case, number 5, setting 
[0 1 0]), lies at dead center. Around it are the three settings that differ from the target 
by exactly one binary digit; surrounding those are the 3 hypotheses two binary digits away 
from the target; the third ring out contains the single hypothesis that differs from the target 
by 3 binary digits. Note that the learner can either cycle or step in or out one ring (binary 
digit) at a time, according to the single-step learning hypothesis; but some transitions are 
not possible because there is no data to drive the learner from one state to the other under 
the TLA. Numbers on the arcs denote transition probabilities between grammar states; 
these values are not computed by the original GW algorithm. 
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Note that in the systems under discussion the target state is always an Absorbing 
State (once the learner is at the target grammar, it can never exit), so the Markov 
chains we will consider always have at least one AS. In the example 3-parameter sys- 
tem, state 2 is also an Absorbing State. Given this formulation, one can immediately 
give a very simple criterion for the learnability of such parameter spaces operated 
upon by the TLA 25 . 

Theorem 4.4.1 Given a Markov chain C corresponding to a parameter space, a 
target parameter setting, and a GW TLA learner that attempts to learn the target 
parameters, 3 exactly 1 AS (corresponding to the target grammar) and every CS 
includes the target state iff target parameters can be correctly set by the TLA in the 
limit (with probability 1). 

Proof. -<=. By assumption, C is learnable. Now assume for sake of contradiction 
that there is some C S that does not include the target state. If the learner starts in 
some state belonging to this CS, it can never reach the target AS, by the definition 
of a closed state. This contradicts the assumption that the space was learnable. 

=>. Assume that there exists exactly I AS in the Markov chain M and no closed 
states C S that do not include the target. There are two cases. Case (i): at some time 
the learner reaches the target state. Then, by definition, the learner has converged 
and the system is learnable. Case (ii): there is no time at which the learner reaches the 
target state. Then the learner must move forever among a set of nontarget states. But 
this by definition forms a closed set of states distinct from the target, a contradiction. 
The argument can be made more rigorous by taking a canonical decomposition of the 
chain C into equivalence classes of states, noting that the target is in an equivalence 
class by itself, and therefore all other states must be transient ones. Consequently, 
the learner must eventually end up at the target state (the only recurrent state) with 
probability 1. I 

It is also of interest to be able to compute the set of inital states from which the 
TLA learner is guaranteed to converge to the target state. The following corollary 
describes these states. 

Corollary 4.4.1 Given a Markov chain C corresponding to a GW TLA learner, the 
set of learnable initial states is exactly the set of states that are connected to the target 
and unconnected to the nontarget closed states of the Markov chain. 



25 Any memoryless algorithm operating on this finite parameter space can be modeled as a first- 
order Markov chain. See appendix B of this chapter. The theorem is true for all such algorithms, 
not just the TLA 
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It is easy to see from inspection of the figure that there are exactly 2 absorbing 
states in this Markov chain, that is, states that have no exit arcs. One AS is the 
target grammar (by definition). The other AS is state 2. Correspondingly, by our 
theorem above, the target is not learnable by the TLA. This is correctly noted by 
Gibson and Wexler. In an attempt to obtain a list of initial states from which the 
learner is unable to reach the target, Gibson and Wexler, list only states 2, and 4. 
State 2, as we have seen is an additional AS, clearly the learner will not reach the 
target from here. State 4 is unconnected to the target by any path in the chain, 
clearly, the learner cannot reach the target from here as well. They compute the 
list of problematic initial states as those, from which the learner can never reach the 
target, in other words, those states which are unconnected to the target. They have 
implicitly assumed that if a triggered path to the target exists, it will be taken with 
probability one. This need not be the case. We will soon see that there are additional 
problematic states, from which the learner cannot reach the target with probability 
one. Gibson and Wexler omit these states in their analysis. 

4.5 Derivation of Transition Probabilities for the 
Markov TLA Structure 

We have argued in the previous section, that the TLA working on finite parameter 
spaces reduces to a Markov chain. This argument cannot be complete without the 
precise computation of the transition probabilities from state to state. We do this 
now. 

Consider, a parametric family with n boolean valued parameters. These define, 
2 n grammars (and by extension, languages), as we have discussed. Let the target 
language L t consist of the strings (sentences) si, s 2 , ..., i.e., 

L t = {si,s 2 ,s 3 ,...} C E* 

Let there be a probability distribution P on these strings 26 , according to which they 
are drawn and presented to the learner. Suppose the learner is in a state s cor- 
responding to the language L s . Consider some other state k corresponding to the 



26 This is equivalent to assuming a noise-free situation, in the the sense that no sentence outside 
of the target language can occur. However, one could choose malicious distributions so that all 
strings from the target are not presented to the learner. If one wishes to include noise, one only need 
consider a distribution P on £* rather than on the strings of L t . Everything else in the derivation 
remains identical. This would yield a Markov chain corresponding to the TLA operating in the 
presence of noise. We study this situation in greater detail in the next chapter. 
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language Lk- What is the probability that the TLA will update its hypothesis from 
L s to Lk after receiving the next example sentence? First, observe that due to the 
single valued constraint, if k and s differ by more than one parameter setting, then 
the probability of this transition is zero. As a matter of fact, the TLA will move 
from s to k only if the following two conditions are met, viz., l)the next sentence it 
receives (say u which occurs with probability P{oo) ) is analyzable by the parameter 
settings corresponding to k and not by the parameter setting corresponding to s, and 
2)the TLA has a choice of n parameters to flip on not being able to analyze u and it 
picks the one which would move it to state k. 

Event f occurs with probability J2weL k \L, P( u ) while event 2 occurs with prob- 
ability 1/n since the parameter to flip is chosen uniformly at random out of the n 
possible choices. Thus the co-occurrence of both these events yields the following 
expression for the total probability of transition from s to k after one step: 



P [s _> k ] = J2 ( l / n ) p ( 



S 3, 



Since the total probability over all the arcs out of s (including the self loop) must be 
f , we obtain the probability of remaining in state s after one step as 

P[s -> s] = I - J2 P i s -> k ] 

k is a neighboring state of s 

Finally, given any parameter space with n parameters, we have 2 n languages. 
Fixing one of them as the target language L t we obtain the following procedure for 
constructing the corresponding Markov chain. Note that this will yield a Markov 
chain with the same topology (in the absence of noise) as the GW procedure in their 
paper. However, there is the significant difference of adding a probability measure on 
the language family. 

• (Assign distribution) First hx a probability measure P on the strings of the 
target language L t . 

• (Enumerate states) Assign a state to each language i.e., each Li. 

• (Take set differences.) Now for any two states i and k, if they are more than 
f Hamming distance apart, then the transition P[i — > k] = 0. If they are I 
Hamming distance apart then P[i —■ k] = -P(Lk \ Li. 

This model captures the dynamics of the TLA completely. We have indicated, 
in a previous footnote, how to extend the model to cover noise. In general, a class 
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of memoryless algorithms can me modeled by a Markov chain. Appendix B of this 
chapter shows how to do this. 

Example (continued). 

Consider again the 3-parameter system in the previous figure with target language 
5 (spec-first, comp-hnal, -V2; English). We can calculate the set differences between 
the languages (this is easily done for unembedded sentences using the data from 
Appendix A). Thereafter, assuming a distribution on the sentences of the target 
(uniform on degree-0 sentences), one could simply follow the procedure prescribed 
above, and obtain the transition probabilities which annotate the Markov chain of 
fig. 4-37. 

For example, since the set difference between states f and 5 gives all of the target 
language, there is a (high) transition probability from state f to state 5. Similarly, 
since states 7 and 8 share some target language strings in common, such as S V, and 
do not share others, such as Adv S and S V 0, the learner can move from state 7 to 
8 and back again. 

Many additional properties of the triggering learning system now become evident 
once the mathematical formalization has been given. It is easy to imagine other 
alternatives to the TLA that will avoid the local maxima problem. For example, as it 
stands, the learner only changes a parameter setting if that change allows the learner 
to analyze the sentence it could not analyze before. If we relax this condition so that 
in this situation the learner picks a parameter at random to change, then the problem 
with local maxima disappears, because there can be only I Absorbing State, namely 
the target grammar. All other states have exit arcs. Thus, by our main theorem, 
such a system is learnable. 

Or consider, for example, the possibility of noise — that is, occasionally the learner 
gets strings that are not in the target language. GW state (fn. 4, p. 5) that this is not 
a problem; the learner need only pay attention to frequent data; how is the learner to 
"pay attention" to frequent data? Unless some kind of memory or frequency-counting 
device is added, the learner cannot know whether the example it receives is noise or 
not. If this is the case, then there is always some finite probability, however small, of 
escaping a local maximum. It appears that the identification in the limit framework 
as given is simply incompatible with the notion of noise, unless a memory window of 
some kind is added. 

We may now proceed to ask the following questions about the TLA more precisely: 

I. Does it converge? 
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2. How fast does it converge? How does this vary with distributional assumptions 
on the input examples? 

3. Since our derivation is general, we can now compute the dynamics for other 
"natural" parameter systems, like the 10-parameter system for the acquisition 
of stress in languages developed by Dresher and Kaye (1990). What results do 
they yield? 

4. Variants of TLA would correspond to other Markov structures. Do they con- 
verge? If so, how fast? 

5. How does the convergence time scale up with the number of parameters? 

6. What is the computational complexity of learning parameterized language fam- 
ilies? 

7. What happens if we move from on-line to batch learning? Can we get PAC-style 
bounds (Valiant, 1984)? 

8. What does it mean to have non-stationary (nonergodic) Markov structures? 
How does this relate to assumptions about parameter ordering and maturation? 

To explore these and other possible variations systematically, let us return to the 
5-way classification scheme for learning models introduced at the beginning of this 
chapter. Recall that we have chosen a particular point in the 5-dimensional space 
for preliminary analysis. This, among other things, corresponds to an assumption of 
no noise, and a uniform probability distribution on the unembedded sentences of the 
target. We have shown how to model this particular learning problem by a Markov 
chain. This allows us to characterize learnability by our theorem earlier. We will 
soon see how to characterize the sample complexity of such a learning system. 

In the next section, we discuss how to characterize the sample complexity of 
a learning system modeled as a Markov chain. Our eventual goal, however, is to 
explore more completely the space of fig. 4-36. We consider variations to our hrst 
learning problem along several dimensions. In particular, we discuss in turn, the 
effect on learnability and sample complexity of distributional assumptions on the 
data (question 2 above), and some variations in the learning algorithm (question 4). 
In the next chapter, we will consider the effect of noise, and how that can potentially 
bring about diachronic syntax change, as well as some alternate parameterizations 
(question 3). 
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4.6 Characterizing Convergence Times for the Markov 
Chain Model 

The Markov chain formulation gives us some distinct advantages in theoretically char- 
acterizing the language acquisition problem. First, we have already seen how given 
a Markov Chain one could investigate whether or not it has exactly one absorbing 
state corresponding to the target grammar. This is equivalent to the question of 
whether any local maxima exist. One could also look at other issues (like station- 
arity or ergodicity assumptions) that might potentially affect convergence. Later we 
will consider several variants to TLA and analyze them formally within the Markov 
framework. We will also see that these variants do not suffer from the local maxima 
problem associated with GW's TLA. 

Perhaps the significant advantage of the Markov chain formulation is that it allows 
us to also analyze convergence times. Given the transition matrix of a Markov chain, 
the problem of how long it takes to converge has been well studied. This question is of 
crucial importance in learnability. Following GW, we believe that it is not enough to 
show that the learning problem is consistent i.e., that the learner will converge to the 
target in the limit. We also need to show, as GW point out, that the learning problem 
is feasible, i.e., the learner will converge in "reasonable" time. This is particularly 
true in the case of finite parameter spaces where consistency might not be as much of 
a problem as feasibility. The Markov formulation allows us to attack the feasibility 
question. It also allows us to clarify the assumptions about the behavior of data and 
learner inherent in such an attack. We begin by considering a few ways in which one 
could formulate the question of convergence times. 

4.6.1 Some Transition Matrices and Their Convergence Curves 

Let us begin by following the procedure detailed in the previous section to actually 
obtain a few transition matrices. Consider the example which we looked at infor- 
mally in the previous section. Here the target grammar was grammar 5 (according 
to our numbering of the languages in Appendix A). For simplicity, let us hrst assume 
a uniform distribution on the degree-0 strings in L 5} i.e., the probability the learner 
sees a particular string Sj in L 5 is 1/12 because there are 12 (degree-0) strings in L 5 . 
We can now compute the transition matrix as the following, where O's occupy matrix 
entries if not otherwise specified: 
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Notice that both 2 and 5 correspond to absorbing states; thus this chain suffers 
from the local maxima problem. Note also (following the previous figure as well) that 
state 4 only exits to either itself or to state 2, hence is also a local maximum. For a 
given transition matrix T, it is possible to compute 



^ = lim T 

m^oo 



If T is the transition probability matrix of a chain, then t 8J , i.e. the element of T 
in the zth row and jth column is the probability that the learner moves from state i 
to state j in one step. It is a well-known fact that if one considers the corresponding 
i,j element of T m then this is the probability that the learner moves from state i to 
state j in m steps. Correspondingly, the i,jth element of T^ is the probability of 
going from initial state i to state j "in the limit" as the number of examples goes to 
infinity. For learnability to hold irrespective of which state the learner starts in, the 
probability that the learner reaches state 5 should tend to I as m goes to infinity. 
This means that column 5 of T^ should consist of l's, and the matrix should contain 
O's everywhere else. Actually we hnd that T m converges to the following matrix as 
m goes to infinity: 
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Examining this matrix we see that if the learner starts out in states 2 or 4, it will 
certainly end up in state 2 in the limit. These two states correspond to local maxima 
grammars in the GW framework. If the learner starts in either of these two states, it 
will never reach the target. From the matrix we also see that if the learner starts in 
states 5 through 8, it will certainly converge in the limit to the target grammar. 

The situation regarding states 1 and 3 is more interesting, and not covered in 
Gibson and Wexler (1994). If the learner starts in either of these states, it will reach 
the target grammar with probability 2/3 and reach state 2, the other absorbing state 
with probability 1/3. Thus we see that local maxima (states unconnected to the 
target) are not the only problem for learnability. As a consequence of our stochastic 
formulation, we see that there are initial hypotheses from which triggered paths exist 
to the target, however the learner will not take these paths with probability one. In 
our case, because of the uniform distribution assumption, we see that the path to 
the target will only be taken with probability 2/3. By making the distribution more 
favorable, this probability can be made larger, but it can never be made one. 

This analysis, motivated as it was by our information-theoretic perspective, con- 
siderably increases the number of problematic initial states from that presented in 
Gibson and Wexler. While the broader implications of this is not clear, it certainly 
renders moot some of the linguistic 27 implications of GW's analysis. 

Obviously one can examine other details of this particular system. However, let 
us now look at a case where there is no local maxima problem. This is the case when 
the target languages have verb-second (V2) movement in GW's 3-parameter case. 



27 For example, GW rely on "connectedness" to obtain their list of local maxima. From this 
(incorrect) list, noticing that all local maxima were +Verb Second (+V2), they argued for ordered 
parameter acquisition or "maturation". In other words, they claimed that the V2 parameter was 
more crucial, and had to be set earlier in the child's language acquisition process. Our analysis shows 
that this is incorrect, an example of how computational analysis can aid the search for adequate 
linguistic theories. 
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Consider the transition matrix (shown below) obtained when the target language is 
L\. Again we assume a uniform distribution on strings of the target. 
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Here we find that T m does indeed converge to a matrix with l's in the hrst column 
and O's elsewhere. Consider the hrst column of T' m . It is of the form: 

(pi(m),p2(rn),p 3 (rn),p 4 (rn),p 5 (rn),p 6 (rn),pr(rn),ps(rn))' 

Here pi(m) denotes the probability of being in state I at the end of m examples 
in the case where the learner started in state i. Naturally we want 



lim pi(m) 



I 



and for this example this is indeed the case. Fig. 4-38 shows a plot of the following 
quantity as a function of m, the number of examples. 

pirn) = min{pi(m)} 

i 

The quantity pirn) is easy to interpret. Thus pirn) = 0.95 means that for every 
initial state of the learner the probability that it is in the target state after m ex- 
amples is at least 0.95. Further there is one initial state (the worst initial state with 
respect to the target, which in our example is L 8 ) for which this probability is exactly 
0.95. We find on looking at the curve that the learner converges with high probability 
within 100 to 200 (degree-0) example sentences, a psychologically plausible number. 
(One can now of course proceed to examine actual transcripts of child input to cal- 
culate convergence times for "actual" distributions of examples, and we are currently 
engaged in this effort.) 

Now that we have made a hrst attempt to quantify the convergence time, several 
other questions can be raised. How does convergence time depend upon the distribu- 
tion of the data? How does it compare with other kinds of Markov structures with 
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the same number of states? How will the convergence time be affected if the number 
of states increases, i.e the number of parameters increases? How does it depend upon 
the way in which the parameters relate to the surface strings? Are there other ways to 
characterize convergence times? We now proceed to answer some of these questions. 

4.6.2 Absorption Times 

In the previous section, we computed the transition matrix for a fixed (in principle, 
this could be arbitrary) distribution and showed the rate of convergence in a certain 
way. In particular, we plotted p(m), (the probability of converging from the most 
unfavorable initial state) against m (the number of samples). However, this is not 
the only way to characterize convergence times. Given an initial state, the time taken 
to reach the absorption state (known as the absorption time) is a random variable. 
One can compute the mean and variance of this random variable. For the case when 
the target language is Li, we have seen that the transition matrix has the form: 



T 



Here Q is a 7-dimensional square matrix. The mean absorption times from states 2 
through 8 is given by the vector (see Isaacson and Madsen (1976) ) 




// 



'J-QVi 



where 1 is a 7-dimensional column vector of ones. The vector of second moments is 
given by 

l i'={I-Q)-\2 l i-l). 

Using this result, we can now compute the mean and standard deviation of the ab- 
sorption time from the most unfavorable initial state of the learner. (We note that 
the second moment is fairly skewed in such cases and so is not symmetric about the 
mean, as may be seen from the previous curves.) The four learning scenarios consid- 
ered are the TLA with uniform, and increasingly malicious distributions (discussed 
later), and the random walk (also discussed later). 
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Learning 


Mean abs. 


St. Dev. 


scenario 


time 


of abs. time 


TLA (uniform) 


34.8 


22.3 


TLA (a = 0.99) 


45000 


33000 


TLA (a = 0.9999) 


4.5 x 10 6 


3.3 x 10 6 


RW 


9.6 


10.1 



4.6.3 Eigenvalue Rates of Convergence 

In classical Markov chain theory, there are also well-known convergence theorems 
derived from a consideration of the eigenvalues of the transition matrix. We state 
without proof a convergence result for transition matrices stated in terms of its eigen- 
values. 

Theorem 4.6.1 Let T be an n X n transition matrix with n linearly independent 
left eigenvectors Xi,...x 2 corresponding to eigenvalues Ai,...,A n . Let x (an n- 
dimensional vector) represent the starting probability of being in each state of the 
chain and tt be the limiting probability of being in each state. Then after k transitions, 
the probability of being in each state x T fc can be described by 



x T A 



7T 



^Afx y;Xi || < max |Aj| fc ^ || x y 8 x 8 



where the y~i's are the right eigenvectors of T . 

This theorem thus bounds the rate of convergence to the limiting distribution tt 
(in cases where there is only one absorption state, tt will have a 1 corresponding to 
that state and everywhere else). Using this result we can now bound the rates of 
convergence (in terms of number, k } of samples) by: 



Learning scenario 


Rate of Convergence 


TLA (uniform) 


O(0.94 fc ) 


TLA(a = 0.99) 


0((1 -10" 4 ) fc ) 


TLA(a = 0.9999) 


O((l-10" 6 ) fc ) 


RW 


O(0.89 fc ) 



This theorem also helps us to see the connection between the number of examples 
and the number of parameters since a chain with n states (corresponding to an n X n 
transition matrix) represents a language family with log 2 (n) parameters. 
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4.7 Exploring Other Points 

We have developed, by now, a complete set of tools to characterize learnability and 
sample complexity of memoryless algorithms working on finite parameter spaces. We 
applied these tools to a specific learning problem which corresponded to a point in our 
5-dimensional space previously investigated by Gibson and Wexler. We also provided 
an account of how our new analysis revised some of their conclusions and had possible 
applications to linguistic theory. Here we now explore some other points in the space. 
In the next section, we consider varying the learning algorithm, while keeping other 
assumptions about the learning problem identical to that before. Later, we vary the 
distribution of the data. 

4.7.1 Changing the Algorithm 

As one example of the power of this approach, we can compare the convergence time of 
TLA to other algorithms. TLA observes the single value and greediness constraints. 
We consider the following three simple variants by dropping either or both of the 
Single Value and Greediness constraints: 

Random walk with neither greediness nor single value constraints: We 

have already seen this example before. The learner is in a particular state. Upon 
receiving a new sentence, it remains in that state if the sentence is analyzable. If not, 
the learner moves uniformly at random to any of the other states and stays there 
waiting for the next sentence. This is done without regard to whether the new state 
allows the sentence to be analyzed. 

Random walk with no greediness but with single value constraint: The 

learner remains in its original state if the new sentence is analyzable. Otherwise, 
the learner chooses one of the parameters uniformly at random and flips it thereby 
moving to an adjacent state in the Markov structure. Again this is done without 
regard to whether the new state allows the sentence to be analyzed. However since 
only one parameter is changed at a time, the learner can only move to neighboring 
states at any given time. 

Random walk with no single value constraint but with greediness: The 

learner remains in its original state if the new sentence is analyzable. Otherwise the 
learner moves uniformly at random to any of the other states and stays there iff the 
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yet another example of how the computational perspective can allow us to rethink 
cognitive assumptions. Of course, it may be that the TLA has empirical support, in 
the sense of independent evidence that children do use this procedure (given by the 
pattern of their errors, etc.), but this evidence is lacking, as far as we know. 

4.7.2 Distributional Assumptions 

In an earlier section we assumed that the data was uniformly distributed. We com- 
puted the transition matrix for a particular target language and showed that con- 
vergence times were of the order of 100-200 samples. In this section we show that 
the convergence times depend crucially upon the distribution. In particular we can 
choose a distribution that will make the convergence time as large as we want. Thus 
the distribution-free convergence time for the 3-parameter system is infinite. 

As before, we consider the situation where the target language is L\. There are 
no local maxima problems for this choice. We begin by letting the distribution be 
parameterized by the variables a, 6, c, d where 

a = P(A={AdvVS}) 

b = P(B= {Adv VO S, Adv Aux VS}) 

c = P(C = {Adv V 01 02 S, Adv Aux V S, 

Adv Aux V 01 02 S}) 
d = P(D = {VS}) 

Thus each of the sets A, B } C and D contain different degree-0 sentences of L\. Clearly 
the probability of the set L\ \{AUBUCUD} is 1 — (a-\-b-\-c-\-d). The elements of each 
defined subset of L\ are equally likely with respect to each other. Setting positive 
values for a, 6, c, d such that a + b + c + d < 1 now defines a unique probability for 
each degree(O) sentence in L\. For example, the probability of (Adv V S) is 6/2, 
the probability of (Adv Aux V S) is c/3, that of (V S) is (1 - (a + b+c+d))/6 
and so on. 

We can now obtain the transition matrix corresponding to this distribution. This 
is shown in Table 4.2. 

Compare this matrix with that obtained with a uniform distribution on the sen- 
tences of L\ in the earlier section. This matrix has non-zero elements (transition 
probabilities) exactly where the earlier matrix had non-zero elements. However, the 
value of each transition probability now depends upon a, 6, c, and d. In particular if 
we choose a = 1/12,6 = 2/12, c = 3/12, d = 1/12 (this is equivalent to assuming a 
uniform distribution) we obtain the appropriate transition matrix as before. Looking 
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L 7 
U 

Table 4.2: Transition matrix corresponding to a parameterized choice for the distri- 
bution on the target strings. In this case the target is L\ and the distribution is 
parameterized according to Section 4.7.2 

more closely at the general transition matrix, we see that the transition probability 
from state 2 to state 1 is (1 — (a + b + c))/3. Clearly if we make a arbitrarily close 
to 1, then this transition probability is arbitrarily close to so that the number of 
samples needed to converge can be made arbitrarily large. Thus choosing large values 
for a and small values for b will result in large convergence times. 

This means that the sample complexity cannot be bounded in a distribution-free 
sense, because by choosing a highly unfavorable distribution the sample complexity 
can be made as high as possible. For example, we now give the convergence curves 
calculated for different choices of a, 6, c, d. We see that for a uniform distribution the 
convergence occurs within 200 samples. By choosing a distribution with a = 0.9999 
and b = c = d = 0.000001, the convergence time can be pushed up to as much as 
50 million samples. (Of course, this distribution is presumably not psychologically 
realistic.) For a = 0.99, b = c = d = 0.0001, the sample complexity is on the order of 
100, 000 positive examples. 

4.7.3 Natural Distributions-CHILDES CORPUS 

It is of interest to examine the fidelity of the model using real language distributions, 
namely, the CHILDES database. We have carried out preliminary direct experiments 
using the CHILDES caretaker English input to "Nina" and German input to "Ka- 
trin"; these consist of 43,612 and 632 sentences each, respectively. We note, following 
well-known results by psycholinguists, that both corpuses contain a much higher 
percentage of aux-inversion and wh-questions than "ordinary" text (e.g., the LOB): 
25,890 questions, and 11, 775 wh-questions; 201 and 99 in the German corpus; but 
only 2,506 questions or 3.7% out of 53,495 LOB sentences. 

To test convergence, an implemented system using a newer version of deMarcken's 
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partial parser (see deMarcken, 1990) analyzed each degree-0 or degree-1 sentence as 
falling into one of the input patterns SVO, S Aux V, etc., as appropriate for the target 
language. Sentences not parsable into these patterns were discarded (presumably "too 
complex" in some sense following a tradition established by many other researchers; 
see Wexler and Culicover (1980) for details). Some examples of caretaker inputs 
follow: 

this is a book ? what do you see in the book ? 

how many rabbits ? 

what is the rabbit doing ? (...) 

is he hopping ? oh . and what is he playing with ? 

red mir doch nicht alles nach ! 

ja , die schwatzen auch immer alles nach (...) 

When run through the TLA, we discover that convergence falls roughly along 
the TLA convergence time displayed in figure 1-roughly 100 examples to asymptote. 
Thus, the feasibility of the basic model is confirmed by actual caretaker input, at 
least in this simple case, for both English and German. We are continuing to explore 
this model with other languages and distributional assumptions. However, there is 
one very important new complication that must be taken into account: we have 
found that one must (obviously) add patterns to cover the predominance of auxiliary 
inversions and wh-questions. However, that largely begs the question of whether the 
language is verb-second or not. Thus, as far as we can tell, we have not yet arrived 
at a satisfactory parameter-setting account for V2 acquisition. 

4.8 Batch Learning Upper and Lower Bounds: 
An Aside 

So far we have discussed a memoryless learner moving from state to state in parameter 
space and hopefully converging to the correct target in finite time. As we saw this was 
well-modeled by our Markov formulation. In this section however we step back and 
consider upper and lower bounds for learning finite language families if the learner was 
allowed to remember all the strings encountered and optimize over them. Needless 
to say this might not be a psychologically plausible assumption, but it can shed light 
on the information-theoretic complexity of the learning problem. 
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Consider a situation where there are n languages Li, L 2} . . . L n over an alphabet 
E. Each language can be represented as a subset of E* i.e. 

Li = {un,^,. • .};ujj G E* 

The learner is provided with positive data (strings that belong to the language) 
drawn according to distribution P on the strings of a particular target language. The 
learner is to identify the target. It is quite possible that the learner receives strings 
that are in more than one language. In such a case the learner will not be able to 
uniquely identify the target. However, as more and more data becomes available, the 
probability of having received only ambigious strings becomes smaller and smaller 
and eventually the learner will be able to identify the target uniquely. An interesting 
question to ask then is how many samples does the learner need to see so that with 
high confidence it is able to identify the target, i.e. the probability that after seeing 
that many samples, the learner is still ambigious about the target is less than 8. The 
following theorem provides a lower bound. 

Theorem 4.8.1 The learner needs to draw at least M = max^j wi/ 1 in (l/^) sam- 
ples (where pj = P(L t D Lj)) in order to be able to identify the target with confidence 
greater than 1 — 8. 

Proof. Suppose the learner draws m (less than M) samples. Let k = argmaxj^pj. 
This means I) M = -^rn — \ ln(I/<5) and 2) that with probability pk the learner receives 
a string which is in both Lk and L t . Hence it will be unable to discriminate between 
the target and the kth language. After drawing m samples, the probability that all of 
them belong to the set L t D Lk is (pk) m - In such a case even after seeing m samples, 
the learner will be in an ambiguous state. Now (pk) m > (pk) since m < M and 
p k < 1. Finally since Mln(l/p k ) = ln((l/p k ) M ) = ln(I/<*)), we see that (pk) m > 8. 
Thus the probability of being ambiguous after m examples is greater than 6 which 
means that the confidence of being able to identify the target is less than I — 8. I 

This simple result allows us to assess the number of samples we need to draw in 
order to be confident of correctly identifying the target. Note that if the distribution 
of the data is very unfavorable, that is, the probability of receiving ambiguous strings 
is quite high, then the number of samples needed can actually be quite large. While 
the previous theorem provides the number of samples necessary to identify the target, 
the following theorem provides an upper bound for the number of samples that are 
sufficient to guarantee identification with high confidence. 

Theorem 4.8.2 If the learner draws more than M = j, ili-b 11 in (l/^) samples, then 
it will identify the target with confidence greater than 1—8. ( Here b t = P(Lt\jj^ t Lj)). 
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Proof. Consider the set L = L t \ Uj^tLj. Any element of this set is present in 
the target language L t but not in any other language. Consequently upon receiving 
such a string, the learner will be able to instantly identify the target. After m > M 
samples, the probability that the learner has not received any member of this set 
is (1 — P(L)) m = (1 — b t ) m < (1 — b t ) M = 6. Hence the probability of seeing some 
member of L in those m samples is greater than 1 — 6. But seeing such a member 
enables the learner to identify the target so the probability that the learner is able to 
identify the target is greater than 1 — 6 if it draws more than M samples. I 

To summarize, this section provides a simple upper and lower bound on the sample 
complexity of exact identification of the target language from positive data. The 6 
parameter that measures the confidence of the learner of being able to identify the 
target is suggestive of a PAC [124] formulation. However there is a crucial difference. 
In the PAC formulation, one is interested in an e-approximation to the target language 
with at least 1 — 6 confidence. In our case, this is not so. Since we are not allowed to 
approximate the target, the sample complexity shoots up with choice of unfavorable 
distributions. There are some interesting directions one could follow within this batch 
learning framework. One could try to get true PAC-style distribution-free bounds for 
various kinds of language families. Alternatively one could use the exact identification 
results here for linguistically plausible language families with "reasonable" probability 
distributions on the data. It might be an interesting exercise to recompute the bounds 
for cases where the learner receives both positive and negative data. Finally the 
bounds obtained here could be sharpened further. We intend to look into some of 
these questions in the future. 

4.9 Conclusions, Open Questions, and Future Di- 
rections 

The problem of learning parameterized families of grammars has several different 
dimensions as we have emphasized earlier. One needs to investigate the learnability 
for a variety of algorithms, distributional assumptions, parameterizations, and so 
on. In this chapter, we have emphasized that it is not enough to merely check for 
learnability in the limit (as previous research within an inductive inference Gold 
framework has tended to do; see, for example, Osherson and Weinstein, 1986); one 
also needs to quantify the sample complexity of the learning problem, i.e., how many 
examples does the learning algorithm need to see in order to be able to identify 
the target grammar with high confidence. To illustrate the importance of this, we 
re-analyzed a particular learning problem previously studied by Gibson and Wexler. 
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Our reanalysis, shows that on finite parameter spaces, the Triggering Learning 
Algorithm in particular, and memoryless algorithms in general, can be completely 
modeled by a Markov process. This Markov model then allows us to check for learn- 
ability in a very simple fashion, rather than the more complicated procedures pre- 
viously used in the linguistics community. Further, it also allows us to characterize 
the sample complexity of learning with such algorithms. On studying the perfor- 
mance of the TLA on the specific 3-parameter subspace from this perspective, we 
found several new results. First, the existence of new problematic initial hypotheses 
was discovered — leading to revisions of certain aspects of maturation and parameter 
ordering suggested by Gibson and Wexler. Second, we showed that the existence 
of local triggers (in other words, a triggered path from the initial hypothesis to the 
target) is not sufficient to guarantee learnability. Third, we found that the TLA 
was suboptimal; for example the random walk algorithm on this space had no local 
maxima and converged faster. 

This analysis on a simple, previously studied, example demonstrates the useful- 
ness of our perspective. It should be reiterated that any finite parameterization, and 
a class of memoryless algorithms can be studied by this approach. There are several 
important questions which need to be pursued further. For example, one could turn 
to other natural parametric systems suggested (the example of metrical phonology 
given in this chapter, a variant studied by Dresher and Kaye (1990), a parameteriza- 
tion chosen by Clark and Roberts (1993)) and so on. One could then establish the 
complexity of learning these other parametric schemes, possibly with useful results 
again. 

Another crucial direction relates to the learning algorithm used. What happens 
when the learner is allowed the use of memory? An interesting investigation of this 
issue has been done by Kapur (1992). However some questions remain unresolved. 
For example, is it true that any algorithm with a finite memory size (n examples, 
say) can be modeled as a finite order Markov chain (presumably, the order would be 
related to n, in some sense)? Is this a useful way to characterize such algorithms? 
A complete characterization of human language requires us to describe the linguis- 
tic knowledge (equivalent to parameterization), and the algorithm children use to 
acquire this knowledge. Insights about the kinds of algorithms available, and their 
psychological feasibility, could often direct the search for the right kind of linguistic 
knowledge. 

It is also of interest to study the relationship between the expressive power of 
the parameterized family of grammars and the number of parameters. One needs 
to reiterate, here, the importance of our point of view in this thesis. Recall how 
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in Chapter 2, we investigated regularization networks from an approximation and 
estimation point of view. Grammars, are no different from regularization networks 
in this sense. Thus, one could pose the following general problem. Assume a class of 
grammars G as the concept class, and a parameterized class of grammars H n as the 
hypothesis class. Now, for a target grammar g £ G, how many example sentences 
need to be drawn, and how large must the number of parameters, n, be, so that the 
learner's hypothesis will be close to the target with high confidence? 

Yet another issue has to do with the "smoothness" relation between the parameter 
settings and the resulting surface strings. In principles-and-parameters theory, it has 
often been suggested that a small parameter change could lead to a large deductive 
change in the grammar, hence a large change in the surface language generated. In all 
the examples considered so far there is a smooth relation between surface sentences 
and parameters, in that switching from a V2 to a non-V2 system, for instance, leads 
us to a Markov state that is not too far away from the previous one. If this is 
not so, it is not so clear that the TLA will work as before. In fact, the whole 
question of how to formulate the notion of "smoothness" in a language-grammar 
framework is unclear. We know in the case of continuous functions, as discussed in 
Chapter 3, that if the learner is allowed to choose examples (which can be simulated 
by selective attention), then such an "active" learner can approximate such functions 
much more quickly than a "passive" learner, like the one presented in GW. Is there 
an analog to this in the discrete, digital domain of language? Further, how can one 
approximate a language? Here too mathematics may play a helpful role. Recall 
that there is an analog to a functional analysis of languages — namely, the algebraic 
approach advanced by Chomsky and Schutzenberger (1963). In this model, a language 
is described by an (infinite) polynomial generating function, where the coefficients on 
the polynomial term x gives the number of ways of deriving the string x. A (weak, 
string) approximation to a language can then be defined in terms of an approximation 
to the generating function. If this method can be deployed, then one might be able to 
carry over the results of functional analysis and approximation for active vs. passive 
learners into the "digital" domain of language. If this is possible, we would then 
have a very powerful set of previously underutilized mathematical tools to analyze 
language learnability. 
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Appendix 



4- A Unembedded Sentences For Parametric Gram- 
mars 

The following table provides the unembedded (degree-0) sentences from each of the 8 
grammars (languages) obtained by setting the 3 parameters of example 1 to different 
values. The languages are referred to as L\ through L 8 . 



Language 


Spec 


Comp 


V2 


Degree-0 unembedded sentences 


Ai 


1 


1 





11 v s" "v o s" "v ol o2 s" "aux v s" "aux v o s" 

"aux v ol o2 s" "adv v s" "adv v o s" "adv volo2 s" 

"adv aux v s" "adv aux v o s" "adv aux v ol o2 s" 


Ass 


1 


1 


1 


"s v" "s v o" "o v s" "s V ol o2" 

"Ol V 02 S" "o2 VOlS" "S AUX V" "S AUX V O" 
"O ATJX V S" "S AUX V Ol 02" "Ol AUX V 02 S" "o2 AUX V Ol S" 

"adv v s" "adv v o s" "adv V ol o2 s" "adv aux v s" 
"adv atjx v o s" "adv aux v ol o2 s" 


A 3 


1 








"V S" "O V S" "o2 Ol V S" "V ATJX S" "O V AUX S" 

"o2 Ol V AUX s" "adv v s" "adv o v s" "adv o2 ol v s" 
"adv V AUX s" "adv o v aux s" "adv o2 Ol V AUX s" 


u 


1 





1 


"s v" "o v s" "s v o" "s v o2 ol" "ol v o2 s" 

"o2 v ol s" "s AUX v" "s AUX o v" "o AUX v s" 

"s AUX o2 ol v" "ol AUX o2 v s" "o2 AUX ol v s" "adv v s" 

"adv v o s" "adv v o2 ol s" "adv aux v s" 

"adv aux o v s" "adv aux o2 ol v s" 


A 5 

(.English, 

>^ench) 





1 





"S V" "S V O" "S V Ol 02" "S AUX V" "S AUX V O" 

"s aux vol o2" "adv s v" "adv s v o" "adv S V ol o2" 
"adv s aux v" "adv s aux v o" "adv s aux v ol o2" 


As 





1 


1 


"s v" "s v o" "o v s" "s vol o2" "ol v s o2" 

"o2 V s ol" "s AUX v" "s AUX v o" "o AUX s v" 

"s aux vol o2" "ol AUX s v o2" "o2 AUX S V ol" "adv v s" 

"adv v s o" "adv V s ol o2" "adv aux s v" "adv aux s v o" 

"adv aux s v ol o2" 


At 

(.Bengali, 

Hindi) 











"s v" "so v" "s o2 ol v" "s V aux" 

"s o2 ol v aux" "adv s v" "adv s o v" "adv s o2 ol v" 

"adv s v aux" "adv s o v aux" "adv s o2 Ol V aux" 


A a 

(Cerman, 

Dutch) 








1 


"s v" "s v o" "o v s" "s v o2 ol" "ol V s o2" 

"02 V S Ol" "S AUX V" "S AUX O V" "O AUX S V" 

"ol AUX s o2 v" "o2 aux sol v" "adv v s" "adv v s o" 

"adv v s o2 ol" "adv aux s v" "adv aux s o v" 

"s AUX o2 ol v" "s o v aux" "adv aux s o2 ol v" 
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4-B Memoryless Algorithms and Markov Chains 

Memoryless algorithms can be regarded as those which have no recollection of previous 
data, or previous inferences made about the target function. At any point in time, 
the only information upon which such an algorithm acts is the current data, and 
the current hypothesis (state). A memoryless algorithm can then be regarded as 
an effective procedure mapping this information to a new hypothesis. In general, 
given a particular hypothesis state (h in 7i, the hypothesis space), and a new datum 
(sentence, s in E*), such a memoryless algorithm will map onto a new hypothesis 
(g G 1~t). Ofcourse, g could be the same as h or it could be different depending 
upon the specifics of the algorithm and the datum. If one includes the possibility of 
randomization, then the mapping need not be deterministic. In other words, given a 
state h, and sentence s, the algorithm maps onto a distribution P% over the hypothesis 
space, according to which the new state is selected. Clearly, 

£ Pn(h) = 1 
hen 

Let V be the set of all possible probability distributions over the (finite) hypothesis 
space. For any P% G V, thus, Pn[h] is the probability measure on the hypothesis 
(state) h. 

A memoryless algorithm can then be regarded as a computable function (/) from 
(7i, E*) to V as follows: 

Thus, for any h G 7i, and s G E*, the quantity f(h } s) is a distribution over 
the hypothesis space according to which the learner would pick the next hypothesis. 
Consequently, a learner following such an algorithm, would update its hypothesis 
with each new sentence, and move from state to state in our finite parameter space 
of hypotheses. Suppose, at a point in time, the learner is in a state h\. What is 
the probability that it will move to state h 2 after the next example? It will do so 
only if the following two conditions are met. First, it receives a sentence (example), 
5, for which f(hi 7 s) has a non-zero probability measure on the state h 2 . Let this 
probability measure be /(/ii, s)[/i 2 ]. Second, given the probability over the hypothesis 
space according to which it chooses the next hypothesis, the learner actually ends up 
choosing h 2 as the next hypothesis. 

Given a distribution P on E*, according to which sentences are drawn, and pre- 
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sented to the learner, the transition probability from hi to h 2 is now given by: 
Prob[h^h 2 ]= Y, f(h u s)[h 2 ]P(s) 

{s\f(h 1 ,s)[h 2 ]>0} 

Having obtained the transition probabilities, it is clear that the memoryless algorithm 
is a Markov chain. 
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Chapter 5 



The Logical Problem of Language Change 



Abstract 

In this chapter, we consider the problem of language change. Linguists have to explain not only how 
languages are learned (a problem we investigated in the previous chapter), but also how and why they 
have evolved in certain trajectories. While the language learning problem has concentrated on the 
behavior of the individual child, and how it acquires a particular grammar (from a class of grammars 
Q), we consider, in this chapter, a population of such child learners, and investigate the emergent, 
global, population characteristics of the linguistic community over several generations. We argue 
that language change is the logical consequence of specific assumptions about grammatical theories, 
and learning paradigms. In particular, we are able to transform the parameterized theories, and 
memoryless algorithms of the previous chapter into grammatical dynamical systems, whose evolution 
depicts the evolving linguistic composition of the population. We investigate the linguistic, and 
computational consequences of this fact. From a more programmatic perspective, we lay a possible 
logical framework for the scientific study of historical linguistics, and introduce thereby, a formal 
diachronic criterion for adequacy of linguistic theories. 



5.1 Introduction 

As is well known, languages change over time. Language scientists have long been oc- 
cupied with describing language changes in phonology, syntax, and semantics. There 
have been many descriptive and a few explanatory accounts of language change, in- 
cluding some explicit computational models. Many authors appeal naturally to the 
analogy between language change and another familiar model of change, namely, 
biological evolution. There is also a notion that language systems are adaptive (dy- 
namical) ones. For instance, Lightfoot (1991, chapter 7, pages 163-65ff.) talks about 
language change in this way: 

Some general properties of language change are shared by other dy- 
namic systems in the natural world. . . 
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Indeed, entire books have been devoted to the description of language change 
using the terminology of population biology: genetic drift, clines, etc. (UCLA book 
on language diversity in space and time). 

However, these analogies have rarely been pursued beyond casual and descriptive 
accounts. 29 In this paper we would like to formalize these linguists' intuitive notions in 
a specific way as a concrete computational model, and investigate the consequences of 
this formalization. In particular, we show that a model of language change emerges as 
a logical consequence of language learnability, a point made by Lightfoot (1991). We 
shall see that Lightfoot's intuition that languages could behave just as though they 
were dynamical systems is essentially correct, and we can provide concrete examples 
of both "gradual" and "sudden" syntactic changes occuring over time periods of many 
generations to just a single generation. 30 

Not surprisingly, many other interesting points emerge from the formalization, 
some programmatic in nature: 

• We provide a general procedure for deriving a dynamical systems model from 
grammatical theories and learning paradigms. 

• Learnability is a well-known criterion for testing the adequacy of grammatical 
theories. With our new model, we can now give an evolutionary criterion. By 
this we mean that by comparing the evolutionary trajectories of derived dynam- 
ical linguistic systems to historically observed trajectories, one can determine 
the adequacy of linguistic theories or learning algorithms. 

• We explicitly derive dynamical systems corresponding to parameterized linguis- 
tic theories (e.g. Head First/Final parameter in HPSG or GB grammars) and 
memoryless language learning algorithms (e.g. gradient ascent in parameter 
space). 

• Concretely, we illustrate the use of dynamical systems as a research tool by 
considering the loss of Verb Second position in Old French as compared to 
Modern French. We demonstrate that, when mathematically modeled by our 
system, one grammatical parameterization in the literature does not seem to 
permit this historical change, while another does. We are also able to more 
accurately model the time course of language change. In particular, in contrast 
to Kroch (1989) and others, who mimic population biology models by imposing 



29 Some notable exceptions are Kroch (1990), Clark and Roberts (1993). 

30 Lightibot 1991 refers to these sudden changes, acting over 1 generation, as "catastrophic" but 
in fact this term usually has a different sense in the dynamical systems literature. 
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an S-shaped logistic change by assumption, we show that the time course of 
language change need not be S-shaped. Rather, language-change envelopes are 
derivable from more fundamental properties of dynamical systems; sometimes 
they are S-shaped, but they can also have a nonmonotonic shape, or even non- 
smooth, "catastrophic" properties. 

• We formally examine the "diachronic envelopes" possible under varying con- 
ditions of alternative language distributions, language acquisition algorithms, 
parameterizations, input noise, and sentence distributions — that is, what lan- 
guage changes are possible by varying these dimensions. This involves the 
simulation of these dynamical systems under different initial conditions, and 
characterizations of the resulting evolutionary trajectories, phase-space plots, 
issues of stability, and the like. 

• The formal diachronic model as a dynamical system provides a novel possi- 
ble source for explaining several linguistic changes including (a) the evolution 
of modern Greek phonology from proto-Indo-European (b) Bickerton's (199x) 
Creole hypothesis (concerning the striking fact that all Creoles, irrespective of 
linguistic origin, have exactly the same grammar) as the condensation point of 
a dynamical system (though we have not tested these possibilities explicitly). 

The Acquisition-Based Model of Language Change: 
The Logical Problem of Language Change 

How does the combination of a grammatical theory and learning algorithm lead to 
a model of language change? We hrst note that, just as with language acquisition, 
there is a seeming paradox in language change: it is generally assumed that children 
acquired their caretaker (target) grammars without error. However, if this were al- 
ways true, at hrst glance grammatical changes within a population could seemingly 
never occur, since generation after generation, the children would have successfully 
acquired the grammar of their parents. 

Of course, Lightfoot and others have pointed out the obvious solution to this 
paradox: the possibility of slight misconvergence to target grammars could, over time 
(generations), drive language change, much as speciation occurs in the population 
biology sense. We pursue this point in detail below. Similarly, just as in the biological 
case, some of the most commonly observed changes in languages seem to occur as the 
result of the effects of surrounding populations, whose features infiltrate the original 
language. 
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We begin our treatment of this subject by arguing that the problem of language 
acquisition at the individual level leads logically to the problem of language change 
at the group (or population) level. Consider a population speaking a particular 
language 31 . This is the target language — children are exposed to primary linguis- 
tic data from this source (language); typically in the form of sentences uttered by 
caretakers (adults). The logical problem of language acquisition is how children ac- 
quire this target language from the primary linguistic data — in other words to come 
up with an adequate learning theory. Such a learning algorithm is simply a mapping 
from primary linguistic data to the class of grammars. For example, in a typical 
inductive inference model (as we saw in the previous chapter), given a stream of sen- 
tences (primary linguistic data), the algorithm would simply update its grammatical 
hypothesis with each new sentence according to some preprogrammed procedure. An 
important criterion for learnability (as we saw in the previous chapter) is to require 
that the algorithm converge to the target as the data goes to infinity. 

Now, suppose that the primary linguistic data presented to the child is altered 
(due, perhaps, to presence of foreign speakers, contact with another population, dis- 
fluencies etc.). In other words, the sentences presented to the learner (child) are no 
longer consistent with a single target grammar. In the face of this input, the learning 
algorithm might no longer converge to the target grammar. Indeed, it might con- 
verge to some other grammar (#2); or it might converge to g 2 with some probability, 
g 3 with some other probability, and so on. In either case, children attempting to 
solve the acquisition problem by means of the learning algorithm, would have inter- 
nalized grammars different from the parental (target) grammar. Consequently, in 
one generation, the linguistic composition of the population would have changed 32 . 
Furthermore, this change is driven by I) the primary linguistic data (composed in 
this case of sentences from the original target language, and sentences from the for- 
eign speakers) 2) the language acquisition device: which acting upon the primary 
evidence, causes the acquisition of a different grammar by the children. Finally, the 
change is limited by the hypothesis space of possible grammars; after all, the children 
can never converge to a grammar which lies outside this space of grammars. 

In short, on this view, language change is a logical consequence of specific assump- 
tions about 

I. the hypothesis space of grammars — in a parametric theory, like the ones we ex- 



31 In our framework of analysis, this implies that all the adult members of this population have 
internalized the same grammar (corresponding to the language they speak). 

32 Sociological factors affecting language change, affect language acquisition in exactly the same 
way, yet are abstracted away from the formalization of the logical problem of language acquisition. 
In this same sense, we similarly abstract away such causes. 
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amine in this thesis, this corresponds to a particular choice of parameterization 

2. the language acquisition device — in other words, the learning algorithm the child 
uses to develop hypotheses on the basis of data 

3. the primary linguistic data — the sentences which are presented to the children 
of any one generation 

If we specify 1) through 3) for a particular generation, we should, in principle, be able 
to compute the linguistic composition for the next generation. In this manner, we 
can compute the evolving linguistic composition of the population from generation 
to generation; we arrive at a dynamical system. We can be a bit more precise about 
this. First, let us recall our framework for language learning. Then we will show how 
to derive a dynamical system from this framework. 

The Language Learning Framework: 

Denote by (?, a family of possible (target) grammars. Each grammar g £ Q defines 
a language L(g) C E* over some alphabet E in the usual fashion. Let there be a 
distribution P on E* according to which sentences are drawn and presented to the 
learner. Note that if there is well defined target, g t} and only positive examples from 
this target are presented to the learner, then P will have all its measure on L(g t ), 
and zero measure on sentences outside of this. Suppose n examples are drawn in 
this fashion, one can then let T> n = (S*) n be the set of all n-example data sets the 
learner might potentially be presented with. A learning algorithm A can then be 
regarded as a mapping from T> n to Q. Thus, acting upon a particular presentation 
sequence d n £ D n , the learner posits a hypothesis A(d n ) = h n £ Q. Allowing for 
the possibility of randomization, the learner could, in general, posit hi £ Q with 
probability pi for such a presentation sequence d n . The standard (stochastic version) 
learnability criterion (after Gold, 1967) can then be stated as follows: 

For every target grammar, g t £ (?, with positive-only examples presented according 
to P as above, the learner must converge to the target with probability 1, i.e., 

Prob[A(d n ) = g t ] y^^ 1 

In the previous chapter, we concerned ourselves with this learnability issue for 
memoryless algorithms in finite parameter spaces. 
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From Language Learning to Population Dynamics: 

The framework for language learning has learners (children) attempting to infer gram- 
mars on the basis of linguistic data (sentences). At any point in time, n, (i.e., after 
hearing n examples) the child learner has a current hypothesis, h } with probability 
p n (h). What happens when there is a population of child learners? Since an arbitrary 
child learner, has a probability p n (h) of developing hypothesis h (for every h £ (?), 
it follows that a fraction p n (h) of the population of children would have internalized 
the grammar h after n examples. We therefore have a current state of the population 
after n examples. This state of the population (of children) might well be different 
from the state of the parental population. Pretend for a moment that after n exam- 
ples, maturation occurs, i.e., the child retains for the rest of its life, the grammatical 
hypothesis after n examples, then we would have arrived at the state of the mature 
population for the next generation 33 . This new generation now produces sentences 
for the following generation of children according to the distribution of grammars in 
the population. The same process repeats itself and the linguistic composition of the 
population evolves from generation to generation. 

Formalizing the Argument Further: 

This formulation leads naturally to a discrete-time dynamical systems model for lan- 
guage change. In order to define such a dynamical system formally, one needs to 
specify 

1. the state space, S — a set of states the system can be in. At any given point in 
time, t, the system is in exactly one state s £ <S; 

2. an update rule defining, the manner in which the state of the system changes 
from one time to the next. Typically, this involves the specification of a function, 
/, which maps s t , (the state at time t) to s i+1 (the state at time t + l). 34 

For example, a typical linear dynamical system might consist of state variables 
x (where x is a ^-dimensional state vector) and a system of differential equations 
x' = Ax (A is a matrix operator) which characterize the evolution of the states 
with time. RC circuits are a simple example of linear dynamical systems. The state 



33 Maturation is a reasonable hypothesis. After all, it seems even more unreasonable to imagine 
that children are forever wandering around in hypothesis space. After a certain point, and there 
is evidence from developmental psychology to suggest that this is the case, the child matures and 
retains its current grammatical hypothesis for the rest of its life. 

34 In general, this mapping could be fairly complicated. For example, it could depend on previous 
states, future states etc. For reference, see Strogatz (1993). 
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s 4 8 5 s 6 4' 7 ^8 



Figure 5-41: A simple illustration of the state space for the 3-parameter syntactic 
case. There are 8 grammars, a probability distribution on these 8 grammars, as 
shown above, can be interpreted as the linguistic composition of the population. 
Thus, a fraction Pi of the population have internalized grammar, </i, and so on. 



(current) evolves as the capacitor discharges through the resistor. Population growth 
models (for example, using logistic equations) provide other examples. 
The State Space: 

In our case, the state space is the space of possible linguistic compositions of the 
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population. More specifically, it is a distribution P pop on the space of grammars, Q 
For example, consider the three parameter syntactic space described in Gibson and 
Wexler (1994) and analyzed in the previous chapter. This defines 8 possible "natural" 
grammars. Thus Q has 8 elements. We can picture a distribution on this space as 
shown in fig. 5-41. In this particular case, the state space is 

S = {PeR 8 \J2P t = i} 

8=1 

We interpret the state as the linguistic composition of the population. For ex- 
ample, a distribution which puts all its weight on grammar §\ and everywhere 
else, indicates a homogeneous population which speaks the language corresponding 
to grammar §\. Similarly, a distribution which puts a probability mass of 1/2 on 
<7i and 1/2 on g 2 indicates a population (non-homogeneous) with half its speakers 
speaking a language corresponding to §\ and half speaking a language corresponding 
to g 2 . 
The Update Rule: 

The update rule is obtained by considering the learning algorithm, A } involved. 
For example, given the state at time t, (P pop ,t) } he., the distribution of speakers in 
the parental population (they are the generators of the primary linguistic data for 



35 0bviously one needs to be able to define a u-algebra on the space of grammars, and so on. For 
the cases we look at, this is not a problem because the set of grammars is finite. 
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the next generation), one can obtain the distribution with which sentences from E* 
will be presented to the learner. To do this, imagine that the zth linguistic group in 
the population (speaking language Li) produces sentences with distribution Pi (on 
the sentences of Li, i.e., sentences not in Li are produced with probability 0). Then 
for any w G S*, the probability with which it is presented to the learner is given by 



p(u) = y.PiHPi 



pop,t 



^ 



Now that the distribution with which sentences are presented to the learner is 
determined, the algorithm operates on the linguistic data, d n , (this is a dataset of 
n example sentences drawn according to distribution P) and develops hypotheses 
(A(d n ) G Q). Furthermore, one can, in principle, compute the probability with which 
the learner will develop hypothesis hi after n examples: 

Finite Sample: Prob[A(d n ) = hi] = p n (hi) (5.33) 

This finite sample situation is always well defined. In other words, the probability p n 
exists 36 . 

Learnability requires p n (gt) to go to 1, for the unique target grammar, g t} if such 
a grammar exists. In general, however, there is no unique target grammar since 
we have non-homogeneous linguistic populations. However, the following limiting 
behavior might still exist: 

Limiting Sample: lim Prob\A(d n ) = hA = pi (5.34) 

Thus, the child, according to the arguments described earlier, internalizes gram- 
mar hi G Q with probability p n {hi) (for a finite sample analysis) and with probability 
Pi "in the limit". We can find pi for every z, and the next generation would then 
have a proportion pi (or p n (hi) } if one wanted to do a finite sample analysis) of people 
who have internalized the grammar hi. Consequently, the linguistic composition of 
the next generation is given by P p0Pjt+ i(hi) = pi(oi p n (hi)). In this fashion, 

p M p 

1 pop,t T 1 pop,t-\-l 



36 This is easy to see for deterministic algorithms, Adet- Such an algorithm would have a precise 
behavior for every data set of n examples drawn. In our case, the examples are drawn in i.i.d. 
fashion according to a distribution P on £ * . It is clear that p n {hi) = P[{d n \Adet(d n ) = hi}]. For 
randomized algorithms, the case is trickier, but the probability still exists. We saw in the previous 
chapter, how to compute p n for randomized memoryless algorithms. 
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Remarks 

1. The finite sample case probability always exists. Suppose, we have solved the 
maturation problem, i.e., we know the rough amount of time, the learner takes to 
develop its mature (adult) hypothesis. This is tantamount to knowing (roughly, if 
not exactly) the number of examples, iV, the child would have heard by then. In that 
case pn{K) is the probability that the child internalizes the grammar h. This [p^ih)) 
is the percentage of speakers of Lh in the next generation. Note that under this finite 
sample analysis, for a homogeneous population, with all adults speaking a particular 
language (corresponding to grammar, g, say), pjv(</) will not be 1 — that is, there will 
be a small percentage who have misconverged. This percentage might blow up over 
generations; and we potentially have unstable languages. This is in contrast to the 
limiting analysis of homogeneous populations which is trivial for learnable families of 
grammars. 

2. The limiting case analysis is more problematic, though more consistent with 
learnability theories "in the limit." First, the limit in question need not always exist. 
In such a case, of course, no limiting analysis is possible. If however, the limit does 
exist, then pi is the probability that a child learner attains the grammar pi in the 
limit — and this is the proportion of the population with this internal grammar in the 
next generation. 

3. In general, the linguistic composition for the it + l)th generation is given in 
similar fashion from the linguistic composition for the tth generation. Such a dynam- 
ical system exists for every assumption of a)*4, and h)Q and c)P 8 's the probability 
with which sentences are produced by speakers of the zth grammar 37 . Thus we see 
that , 

(£?, A, {Pi}) *■ 2)( dynamical system) 

4. The formulation is completely general so far. It does not assume any par- 
ticular linguistic theory, or learning algorithm, or distribution with which sentences 
are drawn. Of course, we have implicitly assumed a learning model, i.e., positive 
examples are drawn in i.i.d. fashion and presented to the learner (algorithm). Our 
formalization of the grammatical dynamical systems follows as a logical consequence 
of this learning framework. One can conceivably imagine other learning frameworks — 
these would potentially give rise to other kinds of dynamical systems; but we don't 
formalize them here. 

At this stage, we have developed our case in abstraction. The next obvious step 
is to choose specific linguistic theories, and learning paradigms, and compute our 



37 Note that this probability could evolve with generations as well. That will complete all the 
logical possibilities. However, for simplicity, we assume that this does not happen. 
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dynamical system. The important questions are: can we really compute all the 
relevant quantities to specify the dynamical system?? Can we evaluate the behavior 
(the phase-space characteristics) of the resulting dynamical system?? Does this allow 
us to shed light on linguistic theories?? We show some concrete examples of this 
in this chapter. Our examples are conducted within the principles and parameters 
theory of modern linguistics. 

5.2 Language Change in Parametric Systems 

The previous section led us through the important steps in formalizing the process 
of language change, leading ultimately to a computational paradigm within which 
such change can be meaningfully studied. We carry out our investigations within 
the principles and parameters framework introduced in the previous chapter. In 
Chapter 4, we investigated the problem of learnability within this framework. In 
particular, we saw that the behavior of any memoryless algorithm can be modeled as 
a Markov chain. This analysis will allow us to solve equations 1 and 2, and obtain 
the update equations of our dynamical system. We now proceed to do this. 

1) the grammatical theory: Assume there are n parameters — this leads to a space 
Q with 2 n different grammars in it. 

2) the distribution with which data is produced: If there are speakers of the 
zth language, Li, in the population, let them produce sentences according to the 
distribution, P 8 -, on the sentences of this language. For the most part, we will assume, 
in our simulations, that this is uniform on degree-0 sentences (exactly as we did in 
our analysis of the learnability problem). 

3) the learning algorithm: Let us imagine that the child learner follows some 
memoryless (incremental) algorithm to set parameters. For the most part, we will 
assume that the algorithm is the TLA or one of the variants discussed in the previous 
chapter. 

From One Generation to the Next: The Update Rule 

Suppose the state of the parental population is P p0Pin on Q. Then one can ob- 
tain the distribution P on the sentences of E* according to which sentences will be 
presented to the learner. Once such a distribution is obtained, we can compute the 
transition matrix T according to which the learner updates its hypotheses with each 
new sentence (as shown in the previous chapter). From T, one can finally compute 
the following quantities: 
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Prob[ Learner's hypothesis = hi £ Q after m examples] == { — (1, . . . , l)'T m }[z] 

Similarly, making use of limiting distributions of Markov chains (see Resnick, 
1992) one can obtain the following (where ONE is a -^ X ^ matrix with all ones). 



Prob[ Learner's hypothesis = h t "in the limit"] = (1, . . . , 1)'(7 - T + ONE)' 1 

These expressions allow us to compute the linguistic composition of the population 
according to our analysis of the previous section. 
Remarks: 

1. The limiting distribution needs to be interpreted. Markov chains corresponding 
to population mixes do not have an absorbing state. Instead they have recurrent 
states. These states will be visited infinitely often. There might be more than one 
state that will be visited infinitely often. However, the percentage of time, the learner 
will be in a particular state might vary. This is provided by the equation above. Since, 
we know the fraction of the time the learner spends in each grammatical state in the 
limit, we assume that this is the probability with which it internalize the grammar 
corresponding to that state in the Markov chain. 

2. The finite case analysis always works. The limiting analysis need not work. 
However, the limiting analysis works only when there is more than one target. That 
is, if there is only one target grammar, for learnable algorithms, all children would 
converge to that target in the limit, and the population characteristics would not 
change with generations. 

We provide now the basic computational framework for modeling language change. 

1. Let 7i"i be the initial population mix, i.e., the percentage of different language 
speakers in the community. Assuming, then, that the zth group of speakers 
produce sentences with probability P 8 -, we can obtain P with which sentences 
in E* occur for the next generation of children. 

2. From P, we can obtain the transition probabilities for the child learners and the 
limiting distribution 7T2 for the next generation. 

3. The second generation produce sentences with TT2- We can repeat step 1 and 
obtain 7r3; in general a population mix 7r; will over a generation change to a mix 

of 7T i+1 . 
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5.3 Example 1: A Three Parameter System 

The previous section developed the necessary mathematical and computational tools 
to completely specify the dynamical systems corresponding to memoryless algorithms 
operating on finite parameter spaces. In this example, we investigate the behavior 
of these dynamical systems. Recall that every choice of (Q } A } {Pi}) gives rise to a 
unique dynamical system. We start by assuming: 

1) Q : This is a 3-parameter syntactic subsystem described in the previous chapter 
(Gibson and Wexler, 1994). Thus Q has exactly 8 grammars. 

2) A : The memoryless algorithms we consider are the TLA, and variants by 
dropping either or both of the single- valued and greediness constraints. 

3) {Pi} '■ For the most part, we assume sentences are produced according to a 
uniform distribution on the degree-0 sentences of the relevant language, i.e., P 8 - is 
uniform on (degree-0 sentences of) Li. 

5.3.1 Starting with Homogeneous Populations: 

Here we investigate how stable the languages in the parametric system are in the ab- 
sence of noise or other confounding factors like foreign speech. Thus we start off with 
a linguistically homogeneous population producing sentences according to a uniform 
distribution on the degree-0 sentences of the target language (parental language). We 
compute the the distribution of the children in the parameter space after 128 example 
sentences (recall, by the analysis of the previous chapter, the learners converge to the 
target with high probability after hearing these many sentences). Some small pro- 
portion of the children will have misconverged; the goal is to see whether this small 
proportion can drive language change — and if so, in what direction. 

A = TLA; Pi = Uniform; Finite Sample = 128 

The table below shows the result after 30 generations. Languages are numbered from 
I to 8 according to the scheme in the appendix of chapter 4. 
Observations: Some striking patterns are observed. 

1. First, all the +V2 languages are relatively stable, i.e., the linguistic composition 
did not vary significantly over 30 generations. This means that every succeeding gen- 
eration acquired the target parameter settings and no parameter drifts were observed 
over time. 

2. Populations speaking -V2 languages all drift to speaking +V2 languages. Thus 
a population speaking L\ starts speaking mostly L 2 . A population speaking language 
Lr gradually shifts to a population with 54 percent speaking L 2 and 35 percent 
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Initial Language Change to Language? 



(-V2) 1 


2 (0.85), 6 (0.1) 


(+V2) 2 


2 (0.98); stable 


(-V2) 3 


6 (0.48), 8(0.38) 


(+V2) 4 


4 (0.86); stable 


(-V2) 5 


2 (0.97) 


(+V2) 6 


6 (0.92); stable 


(-V2) 7 


2 (0.54), 4(0.35) 


(+V2) 8 


8 (0.97); stable 



Table 5.3: Language change driven by misconvergence. A finite-sample analysis was 
conducted allowing each child learner 128 examples to internalize its grammar. Initial 
populations were linguistically homogeneous, and they drifted (or not) to different 
linguistic compositions. The major language groups after 30 generations have been 
listed in this table. 

speaking Z 4 (with a smattering of other speakers) and seems (?) to remain basically 
stable in this mix thereafter. Note that this relative stability of + V2, and the tendency 
of -V2 languages to drift to +V2 ones, are contrary to assertions in the linguistic 
literature. Lightfoot (1991), for example, claims the tendency to lose V2 dominates 
the reverse tendency in the world's languages. Certainly, both English and French 
lost the V2 parameter setting — an empirically observed phenomenon that needs to 
be explained. Right away, we see that our dynamical system does not evolve in 
the expected pattern. The problem could be due to incorrect assumptions about 
the parameter space, the algorithm, initial conditions, or distributional assumptions 
about the sentences. This needs to be examined, no doubt, but we have just seen a 
concrete example of how assumptions about grammatical theory, and learning theory, 
have made evolutionary predictions — in this case the predictions are incorrect, and 
our model is falsified. 

3. The rates at which the linguistic composition changes varies significantly. Con- 
sider for example the change of L\ to L 2 . Fig. 5-42 below shows the gradual decrease 
in speakers of L\ over successive generations along with the increase in L 2 speakers. 
We see that over the hrst 6 or seven generations very little change occurs, thereafter 
over the next 6 or seven generations the population switches at a much faster rate. 
Note that in this particular case, the two languages differ only in the V2 parameter; so 
the curves essentially plot the gain of V2. In contrast, consider fig. 5-43 which shows 
the decrease of L 5 speakers and the shift to L 2 . Here we notice a sudden change; over 
a space of 4 generations, the population has shifted completely. The time course of 
language change has been given some attention in linguistic analyses of diachronic 
syntax change, and we return to this in a later section. 
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Figure 5-42: Percentage of the population speaking languages L\ and L 2 as it evolves 
over the number of generations. The plot has been shown only upto 20 generations, 
as the proportions of L\ and L 2 speakers do not vary significantly thereafter. Notice 
the "S" shaped nature of the curve (Kroch, f 989, imposes such a shape using models 
from population biology, while we obtain this as an emergent property of our dynam- 
ical model from different starting assumptions). Also notice the region of maximum 
change as the V2 parameter is slowly set by increasing proportion of the population. 
L\ and L 2 differ only in the V2 parameter setting. 



4. We see that in many cases, the homogeneous population splits up into different 
linguistic groups, and seem to remain stable in that mix. In other words, certain 
combinations of language speakers seem to asymptote towards equilibrium (atleast 
by examining the 30 generations simulated so far). For example, a population of 
Lr speakers shifts (over 5-6 generations) to one with 54 percent speaking L 2 and 
35 percent speaking Z 4 and remains that way with no shifts in the distribution of 
speakers. Is this really a stable mix? Or will the population shift suddenly after 
another 100 generations? Can we characterize the stable points ("limit cycles")? 
Other linguistic mixes are inherently unstable mixes. They might drift systematically 
to stable situations, or might shift dramatically. 

In table 5.3, why are some languages stable while others are unstable? It seems 
that the instability and the drifts observed are to a large extent an artifact of the 
learning algorithm used. Remember that TLA suffers from the problem of local 
maxima. We notice that those languages whose acquisition is not impeded by local 
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Initial Language 


Change 


to Language? 






-V2 1 


2 (0.41) 


4 (0.19), 6 (0.18) 


8 


(0.13) 


+V2 2 


2 (0.42) 


4 (0.19), 6 (0.17) 


8 


(0.12) 


-V2 3 


2 (0.40) 


4 (0.19), 6 (0.18) 


8 


(0.13) 


+V2 4 


2 (0.41) 


4 (0.19), 6 (0.18) 


8 


(0.13) 


-V2 5 


2 (0.40) 


4 (0.19), 6 (0.18) 


8 


(0.13) 


+V2 6 


2 (0.40) 


4 (0.19), 6 (0.18) 


8 


(0.13) 


-V2 7 


2 (0.40) 


4 (0.19), 6 (0.18) 


8 


(0.13) 


+V2 8 


2 (0.40) 


4 (0.19), 6 (0.18) 


8 


(0.13) 



Table 5.4: Language change driven by misconvergence. A Unite-sample analysis was 
conducted allowing each child learner (following the TLA with single- value dropped) 
128 examples to internalize its grammar. Initial populations were linguistically ho- 
mogeneous, and they drifted to different linguistic compositions. The major language 
groups after 30 generations have been listed in this table. Notice how all initially 
homogeneous populations tend to the same composition. 

sentence it cannot analyze, it chooses any of the alternative grammars and attempts 
to analyze the sentence with it. Greediness is retained; thus the learner retains its 
original hypothesis if the new one is also not able to analyze the sentence. Table 5.4 
shows the distribution of speakers after 30 generations. 

Observations: In this situation there are no local maxima, and the pattern of evolution 
takes on a very different nature. There are two distinct observations to be made. 

1. All homogeneous populations (irrespective of what language they speak) even- 
tually drift to a strikingly similar population mix. What is unique about this mix? 
Is it a stable point (or attractor)? Further simulations, and theoretical analysis is 
needed to resolve this question. 

2. All homogeneous populations drift to a population mix of only +V2 languages. 
Thus, the V2 parameter is gradually set over succeeding generations by all people in 
the community (irrespective of which language they speak). In other words, there is 
as before a tendency to gain V2 rather than lose it (we emphasize again, that this is 
contrary to linguistic intuition). 

Fig. 5-44 shows the changing percentage of the population speaking the different 
languages starting off from a homogeneous population speaking L 5 . As before, learners 
who have not converged to the target in 128 examples are the driving force for change 
here. Note again the time evolution of the grammars. For about 5 generations there is 
only a slight decrease in the percentage of speakers of L 5 . Then the linguistic patterns 
switch over the next 7 generations to a relatively stable mix. 
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Figure 5-44: Time evolution of grammars using greedy algorithm with no single value. 



Initial Language Change to Language? 



Any Language 
(Homogeneous) 



1 (0.11), 2 (0.16), 3 (0.10), 4 (0.14) 
5 (0.12), 6 (0.14), 7 (0.10), 8 (0.13) 



Table 5.5: Language change driven by misconvergence. A finite-sample analysis was 
conducted allowing each child learner (following 1) random walk and 2) the TLA with 
greediness dropped) 128 examples to internalize its grammar. Initial populations were 
linguistically homogeneous, and they drifted to different linguistic compositions. The 
major language groups after 30 generations have been listed in this table. Notice, 
again, how all initially homogeneous populations tend to the same composition. 

A = a) R.W. b) S. V. only; P % = Uniform; Finite Sample = 128 

Here we simulated the evolution of the dynamical systems corresponding to two algo- 
rithms, both of which have no greediness constraint. The two algorithms are 1) the 
random walk described in the previous chapter and 2) TLA with single-value retained 
but no greediness constraint. 

In both cases, the population mix after 30 generations is the same, irrespective of 
the initial language of the homogeneous population. This is shown in table 5.5. 
Observations: 

1. The hrst striking observation is that both algorithms yield dynamical systems 
which arrive at the same population mix after 30 generations. The path by which 
they arrive at this mix is, however, not the same (see fig. 5-45). 
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this proposes that linguistic replacements follow an S-shaped curve in time. In Bailey's 
own words (taken from Kroch, 1990) 

A given change begins quite gradually; after reaching a certain point (say, 
twenty percent), it picks up momentum and proceeds at a much faster 
rate; and finally tails off slowly before reaching completion. The result is 
an S-curve: the statistical differences among isolects in the middle relative 
times of the change will be greater than the statistical differences among 
the early and late isolects. 

The idea that linguistic changes follow an S-curve has been proposed by Osgood 
and Sebeok (1954), Weinreich, Labov, and Herzog (1968). More specific logistic forms 
were proposed by Altmann (1983), and Kroch (1982,1989). The idea of the logistic 
functional form is borrowed from population biology where it is demonstrable that 
the logistic governs the replacement of organisms and of genetic alleles that differ in 
Darwinian fitness. However Kroch concedes that "unlike in the population biology 
case, no mechanism of change has been proposed from which the logistic form can be 
deduced" . 

Crucially, in our case, we suggest a specific acquisition-based model of language 
change. The combination of grammatical theory, learning algorithms, and distribu- 
tional assumptions on sentences drive change — the specific form of the change (which 
might or might not be S-shaped, and might have varying rates) is thus a derivative of 
more fundamental assumptions. This is in contrast with the above-mentioned theories 
of change. 
The effect of maturational time 

One obvious factor influencing the evolutionary trajectories is the maturational 
time, i.e., the number (A) of sentences the child is allowed to hear before forming 
its mature hypothesis. This was kept at 128 in all the systems shown so far. Fig. 5- 
46 shows the effect of A on the evolutionary trajectories. As usual, we plot only 
a subspace of the population. In particular, we plot the percentage of L 2 speakers 
in the population with each succeeding generation. The initial composition of the 
population was homogeneous (with people speaking L\). It is worthwhile to make a 
few observations: 

1. The initial rate of change of the population is highest for the situation where 
the maturational time is the least, i.e., the learner is allowed the least amount 
of time to develop its mature hypothesis. This is hardly surprising. If the 
learner were allowed access to a lot of examples to make its mature hypothesis, 
most of the learners would have reached the target grammar. Very few would 
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The effect of sentence distributions P 8 . 

Another important factor influencing the evolutionary trajectories is the distri- 
bution Pi with which sentences of the zth language, Li, are presented to the learner. 
In a certain sense, the grammatical space and the learning algorithm determine the 
order of the dynamical system. The sentence distributions on the other hand, are like 
the parameters of the dynamical system (we comment on this point later). Clearly 
the sentence distributions affect rates of convergence within one generation as we saw 
in the previous chapter. Further, by putting greater weight on certain word forms 
rather than others, they might influence the systemic evolution in certain directions. 

To illustrate this idea, we consider an example. We study the interaction between 
L\ and L 2 speakers in the community as the sentence distributions with which these 
speakers produce sentences changes. Recall that so far, we have assumed that all 
speakers produce sentences with uniform distributions on degree-0 sentences of the 
language. Now, we consider an alternative distribution as below: 

1. Let Li j2 = L\ fl L 2 . 

2. Pi : Speakers of L\ produce sentences so that all degree-0 sentences of L\^ are 
equally likely and their total probability is p. Further, sentences of L\ \ L\^ are 
also equally likely, but their total probability is I — p. 

3. P 2 : Speakers of L 2 produce sentences so that all degree-0 sentences of L\^ 2 are 
equally likely and their total probability is p. Further, sentences of L 2 \ L\^ 2 are 
also equally likely, but their total probability is I — p. 

4. Other P 8 's are all uniform in degree-0 sentences. 

Thus, the distributions P 8 's are parameterized by a single parameter, p } which 
determines the amount of measure on the sentence patterns in common between the 
languages L\ and L 2 . Fig. 5-47 shows the evolution of the L 2 speakers as p varies. The 
learning algorithm used was the TLA, and the initial population was homogeneous 
(speaking language L\). Thus, the initial percentage of L 2 speakers in the community 
was 0. Notice how the system moves in different ways as p varies. When p is very small 
(0.05), i.e., strings common to L\ and L 2 occur infrequently, the long term implication 
is that L 2 speakers do not grow in the community. As p increases, more strings of L 2 
occur, and the system is driven to increase the number of L 2 speakers until p = 0.75 
when the population evolves into a completely L 2 speaking community. After this, 
as p increases further, we notice (see p = 0.95) that the L 2 speakers increase but 
can never rise to 100 percent of the population, there is still a residual L\ speaking 
component. This is natural, because for such high values of p } a lot of strings common 
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Figure 5-47: The evolution of L 2 speakers in the community for various values of p 
(a parameter related to the sentence distributions P 8 -, see text). The algorithm used 
was the TLA, the inital population was homogeneous, speaking only L\. The curves 
for p = 0.05,0.75, and 0.95 have been plotted as solid lines. 



to L\ and L 2 occur all the time. This means that the learner could converge to L\ 
just as well, and some learners indeed begin to do so increasing the number of the L\ 
speakers. 

This example shows us that if we wanted a homogeneous L\ speaking population 
to move to a homogeneous L 2 speaking population, by choosing our distributions 
appropriately, we could drive the grammatical dynamical system in the appropriate 
direction. This suggests another important application of our dynamical system ap- 
proach. We can work backwards, and examine the conditions needed to generate a 
change of a certain kind. By checking whether such conditions could possibly have 
existed in history, we can falsify a grammatical theory, or a learning paradigm. Note 
that this example showed the effect of sentence distributions, and how to tinker with 
them to obtain desired evolution. One could, in principle, tinker with the grammati- 
cal theory, or the learning algorithm in the same fashion — leading to a powerful new 
tool to aid the search for an adequate linguistic theory. 
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5.3.2 Non- homogeneous Populations: Phase-Space Plots 

For our three-parameter system, we have been able to characterize the update rules 
for the dynamical systems corresponding to a variety of learning algorithms. Each 
such dynamical system has a specific update procedure according to which the states 
evolve, from some initial state. In the earlier section, we examined the evolutionary 
trajectories when the population was homogeneous. A more complete characterization 
of the dynamical system would be achieved by obtaining phase-space plots of this 
system. Such phase-space plots are pictures of the state-space S filled with trajectories 
obtained by letting the system evolve from various initial points (states) in the state 
space. 

Phase-Space Plots: Grammatical Trajectories 

We have described earlier, the relationship between the state of the population in one 
generation and the next. In our case, let II denote an 8-dimensional vector variable 
(state variable). Specifically, II = (7i"i, . . . , 7r 8 )' (with J2i=i ^i) as we discussed before. 
The following schema reiterates the chain of dependencies involved in the update rule 
governing system evolution. The state of the population at time t (in generations), 
allows us to compute the transition matrix T for the Markov chain associated with 
the memoryless learner. Now, depending upon whether we want I) an asymptotic 
analysis or 2) a finite sample analysis, we compute I) the limiting behavior of T m 
as m (the number of examples) goes to infinity (for an asymptotic analysis), or 2) 
the value of T N (where N is the number of examples after which maturation occurs). 
This allows us to compute the new state of the population. Thus II(t + I) = g(JI(t)) 
where g is a complex non-linear relation. 

n(t) =^PonE*^r^r^n(i+i) 

If we choose a certain initial condition IIi, the system will evolve according to the 
above relation and one can obtain a trajectory of II in the 8 dimensional space over 
time. Each initial condition yields a unique trajectory and one can then plot these 
trajectories obtaining a phase-space plot. Now, each such trajectory corresponds to 
a line in the 8-dimensional plane given by J2i=i 7T,- = 1. It is obviously not possible 
to display such a high dimensional object but we plot in fig. 5-48 the projection of a 
particular trajectory onto a two dimensional subspace given by (7Ti(t), vr 2 (t)) (in other 
words, the proportion of speakers of L\ and L 2 ) at different points in time. 

As mentioned earlier, with a different initial condition, we get a different gram- 
matical trajectory. The state space is thus filled with all the different trajectories 
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Figure 5-48: Subspace of a Phase-space plot. The plot shows (7Ti(t), vr 2 (t)) as t varies, 
i.e., the proportion of speakers speaking languages L\ and L 2 in the population. 
The initial state of the population was homogeneous (speaking language L\). The 
algorithm used was the TLA with the single-value constraint dropped. 



corresponding to different initial conditions. Fig. 5-49 shows this. 

Issues of Stability 

We notice from the phase-space plots that many of the initial conditions yield trajec- 
tories which seem to converge to a point in the state space. In the dynamical systems 
terminology, this would correspond to a fixed point of the system. In other words, 
this is a population mix which would remain that way. Some natural questions arise 
at this stage. What are the conditions for stability? How many fixed points are 
there in the system? How do we solve for them? These are interesting questions but 
detailed answers are not within the scope of this thesis. We would like to state here 
a fixed point theorem which allows us to characterize the stable population mixes. 

First, some notational preliminaries. As before, let P 8 - be the distribution on the 
sentences of the zth language Li. From P 8 -, we can construct T 8 -, the transition matrix 
whose elements are given by the explicit procedure documented in the previous chap- 
ter. This matrix, T 8 -, models the behavior of the TLA learner if the target language 
was Li (with sentences from the target produced with Pi). Similarly, one can obtain 
the matrices for variants of the TLA. Note that fixing the P 8 's fixes the T 8 's and these 
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Figure 5-49: Subspace of a Phase-space plot. The plot shows (7Ti(t), vr 2 (t)) as t varies 
for different initial conditions (non-homogeneous populations). The algorithm used 
by the learner is the TLA with single- value constraint dropped. 



can be considered to be the parameters 38 of the dynamical system. If the state of 
the (parental) population at time t is IT(t) , then it is possible to show that the (true) 
transition matrix of the TLA (or TLA-like) learner is T = J2^=i ^i(t)Ti. For the finite 
case analysis, the following theorem holds: 

Theorem 5.3.1 (Finite Case) A fixed point (stable point) of the grammatical dy- 
namical system (obtained by a TLA like learner operating on the 8 parameter space 
with k examples to choose its mature hypothesis) is a solution of the following equa- 
tion: 

n' = ( 7 r 1 ,...,7T 8 ) = (I,...,I)'(^7r 8 r 8 ) fc 



8 = 1 



Proof (Sketch): This equation is obtained simply by setting n(t + I) = II(t). Note 
however, that this is an example of a non-linear multi-dimensional iterated function 
map. The analysis of such a dynamical system is quite non-trivial, and our theorem 
by no means captures all the possibilities. I 



38 There might be some confusion at the two different notions of parameters floating around. Just 
to clarify further; we have n linguistic parameters which define the 2" languages and define the 
state-space of the system. We also have the _P 8 's which characterize the way in which the system 
evolves and are therefore the parameters of the complete grammatical dynamical system. 
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We can similarly state a theorem for the limiting case analysis. 

Theorem 5.3.2 (Limiting Analysis) A fixed point (stable point) of the grammat- 
ical dynamical system (obtained by a TLA like learner operating on the 8 parameter 
space (given infinite examples to choose its mature hypothesis) is a solution of the 
following equation: 

II' = (vn, . . . , tt 8 ) = (1, . . . , 1)'(7 - J2 *iTi + one)- 1 

8 = 1 

where ONE is the 8x8 matrix with all its entries equal to 1. 

Proof: Again this is trivially obtained by setting II(t + 1) = II(t). The expression on 
the right provides an analytical expression for the update equation in the asymptotic 
case. See Resnick (1992) for details. All the caveats mentioned in the proof section 
of the previous theorem apply here as well. I 

Remark: We have just scratched the surface as far as the theoretical characterization 
of these grammatical dynamical systems are concerned. The main purpose of this 
chapter is to show that these dynamical systems exist as a logical consequence of 
assumptions about the grammatical space, and a learning theory. We have demon- 
strated some preliminary simulations with these systems. From a theoretical per- 
spective, it would be very interesting to better understand such systems. Strogatz 
(1993) suggests that non-linear multidimensional (more than 3 dimensions) mappings 
are likely to be chaotic. Such investigations are beyond the scope of this thesis, and 
might be a fruitful area for further research. 

5.4 Example 2: The Case of Modern French: 

The previous example considered a 3-parameter system for which we derived several 
different dynamical systems. Our goal was to concretely instantiate our philosophical 
arguments in sections 2 and 3, and provide a flavor of the many different factors which 
influence the evolution of these grammatical dynamical systems. In this section, we 
briefly consider a different parametric system (studied by Clark and Roberts, 1993). 
The historical context in which we study this is the evolution of Modern French from 
Old French. 

Extensive simulations in the earlier section reveal that while the learnability prob- 
lem of the 3-parameter space can be solved by stochastic hill climbing algorithms, the 
long term evolution of these algorithms have a behavior which is at variance with the 
diachronic change actually observed in historical linguistics. In particular, we saw 
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how there was a tendency to gain rather than lose the V2 parameter setting. While 
this could be an artifact of the class of learning algorithms considered, a more likely 
explanation is that loss of V2 (observed in many of the world's languages like French 
etc.) is due to an interaction of parameters and triggers other than that considered 
in the previous section. We investigate this possibility and begin, by hrst providing 
the parametric theory. 

5.4.1 The Parametric Subspace and Data 

We now consider a syntactic space involving the following 5 (boolean- valued) param- 
eters. We do not attempt to describe these parameters. The interested reader should 
consult Haegeman (1991) for details. 

1. p\\ Case assignment under agreement (pi = 1) or not (pi = 0). 

2. p 2 : Case assignment under government (p 2 = 1) or not ((p 2 = 0). Relevant 
triggers for this parameter include "Adv V S", "S V 0". 

3. p3~. Nominative clitics. 

4. p 4 : Null Subject. Here relevant triggers would include "wh V S 0". 

5. p5~. Verb-second V2. Triggers include "Adv V S" , and "S V 0". 

These 5 parameters now define a space of 32 parameterized grammars. Each 
grammar in this parameterized system can be represented by a string of 5 bits de- 
pending upon the values of pi, . . . } p 5 . We need obviously to look at the surface strings 
(sentences) generated by each such grammar. For the purpose of explaining how Old 
French changed to Modern French over time, Clark and Roberts consider the follow- 
ing sentences. We provide these sentences below. The parameters settings which need 
to be made in order to generate each sentence is provided in brackets. 
The Relevant Data; 

adv V S [*1**1]; SVO [*1**1] or [1***0]; wh V S [*1***]; wh V s [**1**] ; X 
(pro)V [*1*11] or [1**10]; X V s [**1*1]; X s V [**1*0]; X S V [1***0]; (s)VY 
[*1*11] 

The parameter settings provided in brackets determine the grammars which gen- 
erate the sentence. Thus the sentence (adv V S; quickly ran John- incorrect word 
order in English) is generated by all grammars which have case assignment under 
government (p 2 = 1) and verb second movement (p 5 = 1). The other parameters can 
be set to any value. Clearly there are 8 different grammars which can generate (parse) 
this sentence. Similarly there are 16 (8 corresponding to parameter settings of [*!**!] 
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and 8 corresponding to parameter settings of [1***0]) grammars which generate (S V 
0) and 4 grammars which generate ((s) V Y). 

Remark. Note that the set of sentences considered here is only a subset of the the 
total number of (degree-0) sentences generated by the 32 grammars in question. Clark 
and Roberts have only considered this subset and attempted to construct learning 
algorithms and models of diachronic change using genetic algorithms. In order to 
facilitate direct comparison with their results, we have not attempted to expand the 
data set or fill out the space any further. As a result, all the grammars do not have 
unique extensional properties, i.e. some generate the same sentences and are thus 
equivalent . 

5.4.2 The Case of Diachronic Syntax Change in French 

Within this parameter space, it is historically observed that the language spoken in 
France underwent a parametric change from the twelfth century to modern times. In 
particular, a loss of V2 and prodrop is observed. We provide two examples of this. 
In keeping with standard practice, the asterisk denotes an ungrammatical sentence. 
Loss of null subjects: pro-drop 

a. *Ainsi s'amusaient bien cette nuit. (Modern French) 
thus (they) had fun that night. 

b. Si hrent (pro) grant joie la nuit. (Old French) 

thus (they) made great joy the night. 
Loss of V2 

a. *Puis entendirent-ils un coup de tonerre. (Modern French) 
then they heard a clap of thunder. 

b. Lors oirent ils venir un escoiz de tonoire. (Old French) 

then they heard come a clap of thunder 
It has been argued that this transition was brought about by introduction of 

new word orders during the fifteenth and sixteenth centuries resulting in generations 

of children acquiring slightly different grammars and eventually culminating in the 

grammar of modern French. A brief reconstruction of the historical process (after 

Clark and Roberts, 1993) is provided. 

Old French [11011] The language spoken in the twelfth and thirteenth centuries had 

verb-second movement and null subjects, both of which were dropped by the twentieth 

century. The set of sentences generated by the parameter settings corresponding to 

Old French are: 

adv V S - [*1**1]; SVO - [*1**1] or [1***0]; wh V S [*1***] ; X (pro)V [*1*11] 

or [1**10] 
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Note that from the data set, it appears that the Case agreement and nomina- 
tive clitics parameters remain ambiguous. In particular, Old French is in a subset- 
superset relation with another language (generated by the parameter settings of 
11111). Clearly some kind of subset principle (Berwick, 1985) has to be used by 
the learner for otherwise it is not clear how the data would allow the learner to con- 
verge to the Old French grammar in the hrst place. Note that TLA or TLA like 
schemes would not converge uniquely to the grammar of Old French. 

The string (X)VS occurs with 58% and SV(X) occurs with 34% in Old French 
texts. It is argued that this frequency of (X)VS is high enough to cause the V2 
parameter to trigger to +V2. 

Middle French In Middle French, the data is not consistent with any of the 32 
target grammars (equivalent to a heterogeneous population). Analysis of texts from 
that period reveal that some old forms (like Adv V S) decreased in frequency and 
new forms (like Adv S V) increased. It is argued in Clark and Roberts that such 
a frequency shift causes "erosion" of V2, brings about parameter instability and 
ultimately convergence to the grammar of Modern French. In this transition period 
(i.e. when Middle French was spoken/written) the data is of the following form: 

adv V S [*1**1]; SVO [*1**1] or [1***0]; wh V S [*1***]; wh V s [**1**]; 
X (pro)V [*1*11] or [1**10]; X V s [**1*1]; X s V [**1*0]; X S V [1***0]; (s)VY 
[*1*11] 

Thus, we have old sentence patterns like Adv V S (though it decreases in frequency 
and becomes only 10%), SVO, X (pro)V and whVSO. The new sentence patterns 
which emerge at this stage are adv S V (increases in frequency to become 60%), X 
subjclitic V, V subjclitic (pro)V Y (null subjects) , whV subjclitic 0. 
Modern French [10100] By the eighteenth century, French had lost both the V2 
parameter setting as well as the null subject parameter setting. The sentence patterns 
consistent with Modern French parameter settings are SVO [*!**!] or [1***0], X S V 
[1***0], V s [**!**]. Note that this data, though consistent with Modern French, 
will not trigger all the parameter settings. In this sense, Modern French (just like 
Old French) is not uniquely learnable from data. However, as before, we shall not 
concern ourselves overly with this, for the relevant parameters (V2 and null subject) 
are uniquely set by the data here. 

5.4.3 Some Dynamical System Simulations 

We can obtain dynamical systems for this parametric space, for a TLA (or TLA- 
like) algorithm in a straightforward fashion. We show the results of two simulations 
conducted with such dynamical systems. 
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Figure 5-50: Evolution of speakers of different languages in a population starting off 
with speakers only of Old French. 



Homogeneous Populations [Initial— Old French] 

We conducted a simulation on this new parameter space using the Triggering Learning 
Algorithm. Recall that the relevant Markov chain in this case has 32 states. We 
start the simulation with a homogeneous population speaking Old French (parameter 
setting = f f Of f ). Our goal was to see if misconvergence alone, could drive Old French 
to Modern French. 

Just as before, we can observe the linguistic composition of the population over 
several generations. It is observed that in one generation, f 5 percent of the children 
converge to grammar OfOff; 18 percent to grammar 01111; 33 percent to grammar 
11011 (target) and 26 percent to grammar 11111 with very few having converged to 
other grammars. Thereafter, the population consists mostly of speakers of these 4 
languages, with one important difference: f 5 percent of the speakers eventually lose 
V2. In particular, they have acquired the grammar ffffO. Shown in fig. 5-50 are 
the percentage of the population speaking the 4 languages mentioned above as they 
evolve over 20 generations. Notice that in the space of a few generations, the speakers 
of f f Of f , and Of Of f have dropped out altogether. Most of the population now speaks 
language f f f f (46 percent) and Of f f f (27 percent). Fifteen percent of the population 
speaks ffffO and there is a smattering of other speakers. The population remains 
roughly stable in this configuration thereafter. 
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Observations: 

1. On examining the four languages to which the system converges after one gener- 
ation, we notice that they share the same settings for the principles [Case assignment 
under government], [pro drop], and [V2]. These correspond to the three parameters 
which are uniquely set by data from Old French. The other two parameters can take 
on any value. Consequently 4 languages are generated all of which satisfy the data 
from Old French. 

2. Recall our earlier remark that due to insufficient data, there were equivalent 
grammars in the parameter system. It turns out that in this particular case, the 
grammars (01011) and (11011) are identical as far as their extensional properties are 
concerned; as are the grammars (11111) and (01111). 

3. There is subset relation between the two sets described in (2). The grammar 
(11011) is in a subset relation with (11111). This explains why after a few generations 
most of the population switches to either (11111) or (01111) (the superset grammars). 

4. An interesting feature of the simulation is that 15 percent of the population 
eventually acquires the grammar (11110), i.e., they have lost the V2 parameter setting. 
This is the hrst sign of instability of V2 that we have seen in our simulations so far 
(for greedy algorithms which are psychologically preferred). Recall that for such 
algorithms, the V2 parameter was very stable in our previous example. 

Heterogeneous Populations (Mixtures) 

The earlier section showed that with no new (foreign) sentence patterns the gram- 
matical system starting out with only Old French speakers showed some tendency to 
lose V2. However, the grammatical trajectory did not terminate in Modern French. 
In order to more closely duplicate this historically observed trajectory, we examine 
alternative inital conditions. We start our simulations with an initial condition which 
is a mixture of two sources; data from Old French and data from New French (repro- 
ducing in this sense, data similar to that obtained from the Middle French period). 
Thus children in the next generation observe new surface forms. Most of the surface 
forms observed in Middle French are covered by this mixture. 
Observations: 

1. On performing the simulations using the TLA as a learning algorithm on this 
parameter space, an interesting pattern is observed. Suppose the learner is exposed to 
sentences with 90 percent generated by Old French grammar (11011) and 10 percent 
by Modern French grammar (10100), within one generation 22 percent of the learners 
have converged to the grammar (11110) and 78 percent to the grammar (11111). 
Thus the learners set each of the parameter values to 1 except the V2 parameter 
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Figure 5-51: Tendency to lose V2 as a result of new word orders introduced by Modern 
French source in our Markov Model. 



setting. Now Modern French is a non-V2 language; and 10 percent of data from 
Modern French is sufficient to cause 22 percent of the speakers to lose V2. This is the 
behavior over one generation. The new population (consisting of 78 percent speaking 
grammar (11111) and 22 percent speaking grammar (11110)) remains stable for ever. 

2. Fig. 5-51 shows the proportion of speakers who have lost V2 after one gener- 
ation, as a function of the proportion of sentences from the Modern French Source. 
The shape of the curve is interesting. For small values of the proportion of the Mod- 
ern French source, the slope of the curve is greater than 1. Thus there is a greater 
tendency of speakers to lose V2 than to retain it. Thus 10 percent of novel sen- 
tences from the Modern French source causes 20 percent of the population to lose 
V2; similarly 20 percent of novel sentences from the Modern French source causes 40 
percent of the speakers to lose V2. This effect wears off later. This seems to capture 
computationally the intuitive notion of many linguists that a small change in inputs 
provided to children could drive the system towards larger change. 

3. Unfortunately, there are several shortcomings of this particular simulation. 
First, we notice that mixing Old and Modern French sources does not cause the 
desired (historically observed) grammatical trajectory from Old to Modern French 
(corresponding in our system to movement from state (11011) to state (10100) in our 
Markov Chain). Although we find that a small injection of sentences from Modern 
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French causes a larger percentage of the population to lose V2 and gain subject cli- 
tics (which are historically observed phenomena), nevertheless, the entire population 
retains the null subject setting and case assignment under government. It should 
be mentioned that Clark and Roberts argue that the change in case assignment un- 
der government is the driving force which allows alternate parse-trees to be formed 
and causes the parametric loss of V2 and null subject. In this sense, it is a more 
fundamental change. 

4. If the dynamical system is allowed to evolve, it ends up in either of the two 
states (11111) or (11110). This is essentially due to the subset relations these states 
(languages) have with other languages in the system. Another complication in the 
system is the equivalence of several different grammars (with respect to their surface 
extensions) e.g. given the data we are considering, the grammars (01011) and (11011) 
(Old French) generate the same sentences. This leads to multiplicity of paths, con- 
vergence to more than one target grammar and general inelegance of the state-space 
description. 
Future Directions: There are several possibilities to consider here. 

1. Using more data and filling out the state-space might yield greater insight. 
Note that we can also study the development of other languages like Italian or Spanish 
within this framework and that might be useful. 

2. TLA-like hill climbing algorithms do not pay attention to the subset princi- 
ple explicitly. It would be interesting to explicitly program this into the learning 
algorithm and observe the evolution thereafter. 

3. There are often cases when several different grammars generate the same sen- 
tences or atleast equally well fit the data. Algorithms which look only at surface 
strings are unable then to distinguish between them resulting in convergence to all 
of them with different probabilities in our stochastic setting. We saw an example of 
this for convergence to four states earlier. Clark and Roberts suggest an elegance 
criterion by looking at the parse-trees to decide between these grammars. This dif- 
ference between strong generative capacity and weak generative capacity can easily 
be incorporated into the Markov model as well. The transition probabilities, now, 
will not depend upon the surface properties of the grammars alone, but also upon 
the elegance of derivation for each surface string. 

4. Rather than the evolution of the population, one could look at the evolution of 
the distribution of words. One can also obtain bounds on frequencies with which the 
new data in the Middle French Period must occur so that the correct drift is observed. 
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5.5 Conclusions 

In this chapter, we have argued that any combination of (grammatical theory, learning 
paradigm) leads to a model of grammatical evolution and diachronic change. A 
learning theory (paradigm) attempts to account for how children (the individual child) 
solve the problem of language acquisition. By considering a population of such "child 
learners", we have arrived at a model of the emergent, global, population behavior. 
The key point is that such a model is a logical consequence of grammatical, and 
learning theories. Consequently, whenever a linguist suggests a new grammatical, or 
learning theory, they are also suggesting a particular evolutionary theory — and the 
consequences of this need to be examined. 

Historical Linguistics and Diachronic Criteria 

From a programmatic perspective, this chapter has two important consequences. 
First, it allows us to take a formal, analytic view of historical linguistics. Most 
accounts of language change have tended to be descriptive in nature (though signifi- 
cant exceptions are the work of Lightfoot, Kroch, Clark and Roberts, among others). 
In contrast, we place the study of historical linguistics (diachronic phenomena) on a 
scientific 39 platform. In this sense, our conception of historical linguistics is closest 
in spirit to evolutionary theory and population biology 40 (which attempts to describe 
the origin and changing patterns of life) and cosmology (which attempts to describe 
the origin and evolution of the physical universe). 

Second, it allows us to formally pose a diachronic criterion for the adequacy of 
grammatical theories. A significant body of work in learning theory, has already 
sharpened the learnability criterion for grammatical theories — in other words, the 
class of grammars Q must be learnable by some psychologically plausible algorithm 
from primary linguistic data. Now we can go one step further. The class of grammars 
Q (along with a proposed learning algorithm A) can be reduced to a dynamical 
system whose evolution must match that of the true evolution of human languages 
(as reconstructed from historical data). 



39 By scientific, we mean, the construction of models with explanatory, and predictive powers- 
models which can be falsified in the sense of Popper. 

40 Indeed, most previous attempts to model language change, like that of Clark and Roberts (1993), 
and Kroch (1990) have been influenced by the evolutionary models. 
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In This Chapter 

In this chapter, we have attempted to lay the framework for the development of 
research tools to study historical phenomena. To concretely demonstrate that the 
grammatical dynamical systems need not be impossibly difficult to compute (or sim- 
ulate), we explicitly showed how to transform parameterized theories, and memoryless 
learning algorithms to dynamical systems. The specific simulations of this chapter 
are far too incomplete to have any long term linguistic implications, though, we hope, 
it certainly forms a starting point for research in this direction. Nevertheless, there 
were certain interesting results obtained in this chapter. 

1. We saw that the V2 parameter was more stable in the 3-parameter case, 
than it was in the 5 parameter case. This suggests that the loss of V2 (actually 
observed in history) might have more to do with the choice of parameterizations than 
learning algorithms, or primary linguistic data (though, we suggest great caution, 
before drawing strong conclusions on the basis of this study). 

2. We were able to shed some light on the time course of evolution. In particular, 
we saw how this was a derivative of more fundamental assumptions about initial 
population conditions, sentence distributions, and learning algorithms. 

3. We were able to formally develop notions of system stability. Thus, certain 
parameters could change with time, others might remain stable. This can now be 
measured, and the conditions for stability or change can be investigated. 

4. We were able to demonstrate how one could tinker with the system (by changing 
the algorithm, or the sentence distributions, or maturational time) to allow evolution 
in certain directions. This would suggest the kinds of changes needed in linguistics 
for greater explanatory adequacy. 

Further Research 

This has been our hrst attempt to define the boundaries of the problem. There are 
several directions of further research. 

1. From a linguistic perspective, the most interesting thing to do, would perhaps 
be the examination of alternative parameterized theories, and to track the change of 
certain languages in the context of these theories (much like our attempt to track 
the change of French in this chapter). Some worthwhile attempts would include a) 
the study of parametric stress systems (Halle and Idsardi, I992)-and in particular, 
the evolution of modern Greek stress patterns from proto-Indo European; b) the in- 
vestigation of the possibility that Creoles correspond to fixed points in parametric 
dynamical systems, a possibility which might explain the striking fact that all Creoles 
(irrespective of the linguistic origin, i.e., initial linguistic composition of the popula- 

222 



tion) have the same grammar; c) the evolution of modern Urdu, with Hindi syntax, 
and Persian vocabulary. 

2. From a mathematical perspective, one could take this research in many direc- 
tions including a) the formalization of the update rule for other grammatical theories 
and learning algorithms, and the characterization of the dynamical systems implied 
therein b) the investigation of stability issues more closely, and characterizing better 
the phase-space plots c) recall that our dynamical systems are multi-dimensional non- 
linear iterated function mappings — a recipe for chaotic behavior, and a possibility to 
investigate further. 

It is our hope that research in this line will mature to make useful contributions, 
both to linguistics, and in view of the unusual nature of the dynamical systems 
involved, to the study of such systems from a mathematical perspective. 
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Chapter 6 
Conclusions 



Abstract 

This chapter concludes our thesis by articulating the perspective which emerges over the investiga- 
tions of the previous chapters. We discuss the implications of some of our specific results, their role 
in illuminating our point of view, and directions for future research. 

In this thesis, we investigated the problem of learning from examples. Implicit in 
any scientific investigation is a certain point of view — crucial to our point of view 
were: 

1. the belief (and recognition) that the brain computes functions (input- output 
maps). Consequently, a function approximation framework is relevant, and in 
the context of learning, it is of some value to understand the complexity of 
learning to approximate (or identify) functions from examples. 

2. a focus on the informational complexity of learning such functions. Roughly 
speaking , if one wishes to learn from examples, then how many examples does 
one need? 

From this starting point, we proceeded to examine the informational complexity 
of learning from examples in a number of different contexts. Several themes have 
emerged over the course of this thesis. 

6.1 Emergent Themes 

Hypothesis Complexity: The number of examples needed depends upon the com- 
plexity of the hypothesis class used. In view of Vapnik and Chervonenkis' work (and 
numerous other works in statistics), this is reasonably well recognized (though people 
continue to flout this in the design of underconstrained models for learning systems). 
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More crucially, there is an inherent tension between the approximation error and es- 
timation error, and a tradeoff between the two is involved whenever one chooses a 
model of a certain complexity. We demonstrated this explicitly in the case of feed- 
forward regularization networks, but the point is general. In language learning, this 
tension plays a crucial role, and guided our choice of the kinds of linguistic theo- 
ries worth examining from a scientific perspective. We later investigated the sample 
complexity of learning within the principles and parameters framework of modern 
linguistics. It is worthwhile to observe that within this framework as well, the model 
complexity can be measured in some fashion, e.g. the number, and nature of the 
principles (parameters), the Kolmogorov complexity of the grammatical class, and 
so on. The exact nature of the relationship between this model complexity, and the 
number of examples needed to learn was not investigated explicitly, and remains an 
important area for further research. 

Manner and Nature of Examples: The informational complexity of learning from 
examples, clearly depends upon nature of the examples, and the manner in which they 
are provided to the learner. In every case we have treated in this thesis, examples 
were (x, y) pairs consistent with some target function. There were slight differences, 
however, between the specific instances examined in the different chapters. For the 
case of regularization networks, these examples were contaminated with noise. For 
the case of languages investigated later, only positive examples were presented ( i.e., 
all examples had the y- value of I). 

A more interesting observation to make on the question of examples is our in- 
herently stochastic formulation of the problem. Examples were typically randomly 
drawn. This was according to some unknown distribution for regularization networks, 
and the language learner; and according to some known distribution in the case of the 
active function approximator (learner) of chapter 3. Such a stochastic formulation is 
very much in keeping with the spirit of PAC learning, which has influenced much of 
this work. Furthermore, it allows us to to take recourse to laws of large numbers, and 
better characterize rates of convergence, and thereby sample complexity. 

A few further observations need to be made. First, a stochastic formulation is 
not always utilized for investigation of learning paradigms. For example, in typical 
language learning research in an inductive inference setting, the learner is required 
to converge on every training sequence. The rates of convergence of such a learner 
are hard to characterize unless one puts a measure on the training sequences. This 
brings us back to a probabilistic framework, and indeed, such extensions have been 
considered in the past. Second, we observed that if examples are chosen by the 
learner (rather than passively drawn), they could potentially learn faster. Of course, 
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this need not always be the case, and explicit formal studies are required to decide 
one way or the other. Third, the actual number of examples required (in a passive 
setting) depends upon the distribution with which data is presented to the learner. In 
the case of regularization networks, we were able to obtain distribution-free bounds, 
but these are only bounds, and as various researchers have noted, are often weak. For 
language learning in finite parameter cases, we see this dependence on distributions 
explicitly. We notice here that no distribution-free bound exists. 
The Learning Algorithm Used: The informational complexity of learning also 
depends upon the kind of algorithm the learner uses in making its hypotheses about 
the target. A poorly motivated algorithm might not even converge to the target (as 
the data goes to infinity), let alone do this in reasonable time. Again, this is not par- 
ticularly surprising, and the point becomes vacuous without explicit characterization 
of the relationship (between algorithm and sample complexity) in some form. The 
degree of constraints on the learning algorithms we examined, varied from chapter 
to chapter. For the case of regularization networks, our results were valid for any 
algorithm which minimized a mean-square error term. Though, we did not prove it 
in the thesis, it turns out that algorithms minimizing cross entropy terms (for pattern 
classification) are covered in the analysis as well. In our investigation of active learn- 
ing, we considered the approximation scheme (a component of the learning algorithm) 
explicitly. As a matter of fact, all comparisons between passive and active methods 
of data collection were made between learners using the same approximation scheme 
(thereby eliminating the influence of the approximation scheme on sample complex- 
ity). Active and passive learners represent two significantly different kinds of learning 
algorithms. We saw how to derive an active scheme from a passive one in a function 
approximation setting, and how such a scheme could then potentially reduce the in- 
formational complexity of learning. For language learning, we were able to show that 
all memoryless algorithms could be modeled as a Markov chain. However, the tran- 
sition probabilities of these chains depended on the specific nature of the algorithms. 
We explicitly computed these transition matrices for a number of variants of the TLA 
(single step, greedy ascent) and showed how the sample complexity seemed to vary 
for the same task. It should also be noted that our analysis scheme in language learn- 
ing (Markov chains) were derived from the learning algorithms used. In this sense, 
the sample complexity results were sharp (exact). This is in contrast to bounds on 
the sample complexity which can be obtained by using uniform convergence type 
arguments, or other techniques. 

Learnability and Evolution: An important connection between learning systems 
and evolutionary systems emerged toward the end of this thesis. Both kinds of sys- 
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terns are adaptive ones. However, according to our analysis here, learning occurs 
at the level of the individual, evolution at the level of the population. Clearly, the 
two interact — and an information-theoretic point of view is important for an under- 
standing of such interactions. The manner, nature, and number, of examples, the 
complexity of the hypothesis spaces, the learning algorithms used have implications 
for global evolutionary trajectories of populations of learners. In this sense, a theory 
of learning which attempts to explain individual behavior logically implies a certain 
group behavior. We demonstrated this connection explicitly for the human language 
system. This kind of evolutionary analysis of learning systems could serve as an 
important research tool in a number of different contexts. Certainly, in economic sys- 
tems, one could examine the evolution (adaptation) of the global (macro) economy 
as a result of the behavior (also adaptive) of the individual economic agents. 

6.2 Extensions 

The previous section described the broad results, and the emergent perspective of 
this thesis. With this perspective, one could proceed in several directions. 

1. Model Selection: At a fundamental level, one could examine the question of 
model selection in general. In the cases we considered, the models (family of 
functions, or hypothesis classes) were homogeneous (in fact, often parameter- 
ized) , i.e., all functions in our hypothesis class had the same representation 
(as regularization networks, parameterized grammars, or spline functions etc.). 
The task of learning reduced primarily to the task of estimating the values of 
the parameters. What if we have qualitatively different kinds of models? In- 
stead of choosing the best hypothesis h £ Ti as all our learning problems were 
posed, what if we were interested in choosing the best class Ti £ Ti-super^ To 
make matters a little concrete, suppose one were interested not in choosing the 
best regularization network for a certain problem, but in deciding whether reg- 
ularization networks were best for that problem (other candidate models might 
include multi-layer perceptrons, or polynomials, etc..)? In the case of languages, 
one might be interested in choosing between bigrams, context-free grammars, 
parameterized theories, etc. How does one characterize the complexity of the 
super class 7i super whose individual elements are not functions but classes of 
functions (models)? For that matter, how does one measure the distance be- 
tween two models, i.e., J(7ii,7i 2 )? This matter is of some interest in recent 
times as increased computational power has made it possible for researchers to 
literally "throw models at the data" in their frantic search for "good" ones? 
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This is also a subject of interest to researchers in the held of data mining. 

2. Informational Complexity of Grammars: Another fruitful area of research is the 
informational complexity of learning grammars. As has been mentioned earlier, 
most language learning research tends to focus on the Gold paradigm of iden- 
tification in the limit, without due attention to the rates at which the learner 
attains the target. Given, the arguments of "poverty of stimulus" invoked in 
the modern approach to linguistics, an informational perspective is bound to be 
of some value in choosing between alternate theories. For example, what is the 
sample complexity of learning bigrams, trigrams, lexical-functional grammars, 
metrical stress patterns, optimality theory etc.? How does it depend upon 
the algorithms used, noise, sentence distributions? What are psychologically 
plausible algorithms? What are "real" sentence distributions like? Quantita- 
tive answers to these questions would considerably aid the search for the right 
linguistic theory. One could also potentially decompose the language learning 
problem into approximation and estimation parts. For example, by analogy 
with our analysis of regularization networks, we can pose the following simple 
problem. Let M n be class of all finite state grammars with at most n states 
(analogous to H n : networks with at most n hidden units). Let M = W^ =1 M n . 
Let M. (analogous to J 7 ) be some class which can be approximated by M (it 
could simply be M itself). Let examples be sentences drawn from some target 
grammar m £ M. Then how many states (n) must we have, and how many 
examples must we draw so that with high confidence, the learner's grammar 
(extensionally) is e close to m with high confidence? 

3. Evolutionary Systems: We argued, in chapter 5, that specific assumptions about 
linguistic theories, and learning paradigms, leads automatically to a model of 
language change. We were able to transform memoryless algorithms operating 
on finite parameter spaces into explicit dynamical systems. There are several 
interesting directions to pursue within this area of research. First, one could at- 
tempt to obtain similar dynamical systems corresponding to other assumptions 
about linguistic theories, and learning algorithms. Second, from a purely math- 
ematical perspective, it would be interesting to study the classes of dynamical 
systems motivated by linguistics. For example, we saw that the the systems 
for finite parameter spaces were non-linear, and multi-dimensional. The mathe- 
matical characterization of such systems are far from trivial. Finally, of course, 
one must attempt to put such evolutionary models to good scientific use, by 
validating against real cases of language change. Such an enterprise, will hope- 



228 



fully result in a mathematically productive, and scientifically fruitful study of 
historical linguistics. In general, the connection between individual learning, 
and group evolution is an interesting one. It can be studied in other contexts, 
and formal connections between learning theory, and evolutionary theory need 
to be developed further. 

Computational Complexity: This thesis focused almost exclusively on the num- 
ber of examples needed so that the learner's hypothesis is close to the target. 
The computational complexity of choosing a hypothesis (once the requisite num- 
ber of examples have been provided) is a matter of great importance, and largely 
ignored in this thesis. For example, our main theorem in Chapter 2 assumes 
that the learner will be able to find the global minimum of the mean-square 
error term. In general, this problem, as we have noted, is likely to be iVP-hard. 
Similarly, in Chapter 3, the active learner reduces the informational complexity 
at the cost of increasing the computational burden. For the cases we examined, 
an analytical solution to the sequential optimal recovery problem allowed us to 
obtain tractable solutions. In general, however, the complexity of solving the 
optimal recovery equations (and recovering the optimal point to sample at each 
stage) could well be intractably high. Further, in the case of language learning, 
we obtained sample complexity bounds which were tuned to specific algorithms 
known to be feasible, and psychologically plausible. These algorithms, of course, 
don't learn every possible parameterized space. The complexity of learning a 
parameterized space, in general, could well be iVP-hard (scalability with respect 
to number of parameters, and examples). These are directions worth pursuing. 
After all, a truly realistic cognitively plausible theory of human learning should 
require not only a feasible number of examples, but should also have low com- 
putational (cognitive) burden. 
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